• 沒有找到結果。

Chapter 2. The Parallel Computing of Finite Element Method for

2.2 Finite Element Method (FEM)

2.2.2 The Galerkin Weighted Residual Method

In FEM, we employ Galerkin weighted residual method (GWRM) with the three-dimensional C0-linear shape-function for a tetrahedral mesh and for a hexahedral mesh. In GWRM, a weighted average of residual over the entire domain should be zero and the trial function is the element shape function associated with each ai. The representations of GWRM with three-dimensional C0-linear global coordinate shape function and local coordinate shape function are given in Eqs. (2-8a)

and Eqs. (2-8a), respectively.

(

, , ;

) (

, ,

)

=0

∫∫∫

R x y z a Ni x y z dxdydz (2-8a)

(

, , ;

) (

, ,

)

=0

∫∫∫

Rξ η ρ a Ni ξ η ρ dxdydz (2-8b) where R(x,y,z;a) or R

(

ξ,η,ρ;a

)

is the residual of the governing equation.

2.3 Calculation of Electrostatic (ES) Field

Since the concept of GWRM is introduced, this section begins with derivation of the element equation of Poisson’s equation for typical ES fields (as shown in Eqs.

(2-5)). A trial solution is first constructed to approximate the exact electric potential,φ

( )

~

( )

;α

r U

r , which can be written with shape function using global coordinate for a tetrahedral mesh,

( )

( )

The local coordinate shape for a hexahedral mesh will be presented later. The major

steps for formulating Eqs.(2-5) using GWRM with a three-dimensional tetrahedral mesh in global coordinates are described in detail as follows.

Step 1: Derive the typical element equation of Eqs. (2-5) using GWRM.

4

Step 2: Integrate by parts.

x

Eqs. (2-12) can be reformulated using divergence theorem, it may be written,

where τrn(e) is the outward-normal component of the flux. All load terms are placed on the RHS; this includes the interior load and the boundary fluxes.

Step 3: Substitute the trial solution into element equations.

Inserting Eqs. (2-9) into Eqs. (2-13) yields,

4

Step 4: Substitute the 3-D C0-linear shape function for a tetrahedral mesh into element equation.

The LHS of Eqs. (2-14) may be reformulated,

∫∫∫

+ +

m

Where V is the volume of cell and the subscripts i, k, l, m have the values 1, 2, 3, 4 (see Fig. 2.2) for N1(e)(x,y,z) and are permuted cyclically forN2(e)(x,y,z), N3(e)(x,y,z)andN4(e)(x,y,z).

Step 5: Evaluate the interior load term and boundary flux term of Eqs .(2-14).

For coupling with Particle-In-Cell method in later chapter, here we interpolate charges from the particles to the nodes [Santi et. al., 2003], that

is:

Where the subscript k represents charged particle properties. Substituting Eqs. (2-18) into the interior load term, it becomes

Step 6: Assemble the element equations into a system equation

Using an assembly procedure, The system equation is formed,

Ki(,sj)aj =Fi(s) (2-22) After these theoretical developments, we may apply the boundary conditions to Eqs. (2-22).

For a three-dimensional hexahedral mesh with local coordinate shape function, the same steps 1-3 and 6 are repeated for Eqs.(2-5), and the step 4-5 are be modified as follows.

Step 4: Substitute the 3-D C0-linear shape function for a hexahedral mesh into element equation.

The Eqs. (2-15) is transformed into a form appropriate for numerical evaluation.

The stiffness integral may now be written as the following quadrature expressions: Gauss points at which the integral is evaluated. n is the number of Gauss points.

Here, the derivative of shape function with respect to x, y, z is handled using the chain rule of differentiation, and it yields:

[ ]

Step 5: Evaluate the interior load term and boundary flux term of Eqs .(2-24).

Again, for coupling with Particle-In-Cell method in later chapter, we

interpolate charges from the particles to the nodes, that is:

Where the subscript k represents charged particle properties. Then, ρi(e) is weightied to the Gauss point becomes ((e), , )

nm nl

nkη ρ

ρξ , and substitute this into the interior load integral term in quadrature expressions:

)

Now considering the boundary flux integral term in quadrature expressions,

)

2.4 Sparse Matrix Storage Schemes

The FEM formulations and assembly techniques typically lead to large and sparse matrices. Thus, it becomes essential to store sparse matrix in some kind of storage schemes, especially for an efficient matrix-by-vector product of the iterative method. Nearly all schemes have these two following storage components, e.g., [Golub and Van Loan, 1996], and [Saad, 2003]:

1. A one-dimensional array, which is called the primary array with the size of the

2. Two one-dimensional arrays of integer identifiers, which are called secondary arrays with the size of the number of non-zero entries, for recognizing which

entries of the matrix are stored in the primary array.

The Compressed Sparse Row (CSR) scheme is used for storing the current FEM sparse matrix, in which scheme the each entry of primary array is stored row-by-row.

2.5 Preconditioned Conjugate Gradient Method (PCG)

Among the iterative methods, the Preconditioned Conjugate Gradient method (PCG) is extremely effective for solving the symmetric positive define systems. The PCG method was developed in 1952 by Hestenes and Stiefel, which is an improvement to the steepest descent method [Saad, 2003]. The steepest descent approaches the solution asymptotically, however, the disadvantage of this method is that the speed of convergence may be very slow if the condition number of the matrix A is large. PCG is an efficient implementation of the conjugate directions method in which the search directions are constructed by conjugation of the residuals. In this section, the theory of PCG will not be described in detail, which can be found elsewhere, e.g., [Golub and Van Loan, 1996], [Saad, 2003], and [Barrett and Berry, 1994]. The algorithm is given in the following,

Algorithm 1.Preconditioned Conjugate Gradient Method

Where k is the iterative number, r is the residual, x is the solution vector, p is the step direction, α is the step length, and β is the correction factor. Preconditioner M is defined as the diagonal of stiffness A, known as Jacobi preconditioning, is equivalent to scaling the quadratic form along the coordinate axes.

2.6 Multi-frontal Massively Parallel Solver (MUMPS)

MUMPS [Amestoy et. al., 2000] is a package for solving systems of linear equations of the form Ax = b. Unlike PCG, the stiffness matrix, A, is a square sparse matrix that can be either un-symmetric, symmetric positive definite, or general

symmetric. MUMPS uses a multi-frontal technique, which is a direct method based on either the LU or the LDLT factorization of the matrix. In the following, the main features and steps of MUMPS from its userguide are given in turn.

The main features of the MUMPS package include the solution of the transposed system, input of the matrix in assembled format (distributed or centralized) or elemental format, error analysis, iterative refinement, scaling of the original matrix, return of a Schur complement matrix, and several built-in ordering algorithms. The details of this technique can be found in the reference of its user-guide. The system Ax

= b is solved in three main steps:

1. Analysis. The host performs an ordering based on the symmetrized pattern A+AT, and carries out symbolic factorization. A mapping of the multifrontal computational graph is then computed, and symbolic information is transferred from the host to the other processors. Using this information, the processors estimate the memory necessary for factorization and solution.

2. Factorization. The original matrix is first distributed to processors that will participate in the numerical factorization. The numerical factorization on each frontal matrix is conducted by a master processor (determined by the analysis phase) and one or more slave processors (determined dynamically). Each processor allocates an array for contribution blocks and factors; the factors must

be kept for the solution phase.

3. Solution. The right-hand side b is broadcast from the host to the other processors.

These processors compute the solution x using the (distributed) factors computed during Step 2, and the solution is either assembled on the host or kept distributed on the processors. Each of these phases can be called separately and several instances of MUMPS can be handled simultaneously. MUMPS allows the host processor to participate in computations during the factorization and solve phases, just like any other processor. For both the symmetric and the unsymmetric algorithms used in the code, MUMPS has chosen a fully asynchronous approach with dynamic scheduling of the computational tasks. Asynchronous communication is used to enable overlapping between communication and computation. Dynamic scheduling was initially chosen to accommodate numerical pivoting in the factorization. The other important reason for this choice was that, with dynamic scheduling, the algorithm can adapt itself at execution time to remap work and data to more appropriate processors. In fact, we combine the main features of static and dynamic approaches; MUMPS uses the estimation obtained during the analysis to map some of the main computational tasks; the other tasks are dynamically scheduled at execution time. The main data structures (the original matrix and the factors) are similarly partially mapped according to

the analysis phase.

MUMPS distributes the work tasks among the processors, but an identified processor (the host) is required to perform most of the analysis phase, to distribute the incoming matrix to the other processors (slaves) in the case where the matrix is centralized, and to collect the solution.

2.7 Parallel computing of FEM

In parallel computing of FEM, the computational domain is first partitioned divided into a number of non-overlapping sub-domains, which is equal to the number of processors. One processor is assigned for the computation of each sub-domain, and communications are required between processors whenever needed, e.g., [Farhat et.

al., 1995], and [Hodgson and Jimack, 1993]. Fig. 2.4 shows the three different kinds of partitioning methods, which are common used in graph-partitioning techniques [Saad, 2003]. We use the element-based partitioning to partition the domain, in which

partitioning method, there is no element should is split between two sub-domains, e.g., [Gullerud and Dodds, 2001], and [Thiagarajan and Aravamuthan, 2002].

The different type of approach is used for the different type of domain decomposition [Saad, 2003]. When the domain is partitioned into a set of overlapping sub-domains in which case overlapping Schwartz methods are used for the solution of the system. On the other hand, iterative sub-structuring methods are used for domain

is partitioned into a set of non-overlapping sub-domains. There are two typical non-overlapping domain decomposition methods used in parallel computing of FEM, which are the subdomain-by-subdomain (SBS) and the Schur complement method. We use the SBS approach for paralleling FEM, in which the global stiffness matrix is divided a numbers of partitioned matrices and be stored on each corresponding processor. Then the PCG method should be performed on the SBS basis. The details of SBS method are described in the following.

Before introducing SBS method, the graph-partitioning library, METIS [Karypis, and Kumar, 1995], is first used to decompose the computational domain Ω into p

non-overlapping sub-domains (see Fig. 2.5).

i

The following Eqs. (2-34) is the standard block-arrowhead structure of the stiffness matrix usually formed from SBS method.

⎥⎥

This kind of matrix structure stems from the special nodes re-ordering on each

sub-domain using SBS method. For each sub-domain, the rule of nodes re-ordering is that the internal nodes is numbered first and last the interfacial nodes. The matrices of internal nodes (Ai,Bi,BiT,xi, andbi) are contributed entirely from its corresponding sub-domain, and the matrices of interfacial matrices (As,Bs,BsT,xs, andbs) have the contributions from all sub-domains. Since all these matrices are concurrently assembled on each processor, the PCG algorithm using SBS method is given as follows, e.g., [Saad, 2003], [Gullerud and Dodds, 2001], [Khan and Topping, 1996],

and [Jimack, and Touheed, 1997],

Algorithm 2. Parallel Preconditioned Conjugate Gradient Method

.

k

This algorithm shows that the PCG with SBS method should be performed concurrently on each processor, whereas the communication is performed in two matrix operators: inner product and update [Jimack, and Touheed, 1997]. The subroutine inner product is used to calculate the inner product of two distributed vector between processors and then it requires a single globe communication for providing each processor with a copy of this sum. Since this sum is calculated repeatedly, it should be scaled by the reciprocal of the number of processor. The subroutine update is used to sum the distributed contribution of interfacial nodes via local communication.

The technique of non-overlapping domain decomposition we utilized in parallel computing of FEM is briefly described in the next section.

2.7.1 Domain Decomposition Method

There are two typical domain decomposition methods, which are geometry-based and graph-based domain decomposition. Since Geometry-based

method provided poor edge cut (Ec) and poor load balance, e.g., [Tseng, 2005], [Wehage and Haug, 1982], and [Simon, 1991], we use the graph-based method for our

domain decomposition method. For the graph theory, a sketch of two-dimensional triangle mesh (graph) is shown in Fig. 2.6. This figure shows that a graph is the collection of the vertices of each cell and edge cuts between the cells. Most of this method was developed by the scientists who major in computer science, and the main idea of this method is to subdivide the n vertices between the NP sub-domains while minimizing the number of edge cuts, and balancing the weight in each sub-domain.

Tseng [Tseng, 2005] had reviewed the partition method using graph-based method, e.g. [Simon, 1991], [Vanderstraeten et. al., 1996], and [Barnard and Simon, 1994]

including Greedy partitioning, recursive spectral bisection, multi-level scheme, two-step method. And he recommended that the graph-partitioning library, named METIS, developed by University of Minnesota using multi-level scheme, especially has impressive performance in terms of CPU time and very easy for implementation.

After obtaining the partitioned data, a converter should be designed for preprocessing the mesh (grid) data. The main function of this converter is to reorder the fully unstructured mesh data into the globally sequential but locally unstructured mesh data

for obtaining the relationship between global and local cell data by a simple arithmetic operation due to this special cell-numbering design [Tseng, 2005]. In addition, the nodes numbers of each sub-domain must be re-ordered follow the requirement for FEM using SBS method. This converter is a improvement of Tseng’s method.

2.8 Parallel Adaptive Mesh Refinement using a Tetrahedral Mesh (PAMR)

Fig. 2.7 shows the proposed overall procedures of parallel adaptive mesh

refinement for an unstructured tetrahedral mesh. Only the general procedures are described in this thesis, while the details and results of the parallel implementation can be found elsewhere [Lian et. al., 2006]. Basically, the parallel mesh refinement procedures in Fig. 2.7 are similar to those presented earlier for serial mesh refinement elsewhere [Wu et. al., 2004]. In the serial mesh refinement, the cells are first examined to identify if cell refinement is necessary. If so, then they are refined

“isotropically” into eight child cells. The generated hanging nodes are then removed following the procedures proposed in Wu et al. [Wu et. al., 2004] in which the cells are further refined into two, four, or eight child cells.

However, the detailed procedures and related data structure become more complicated than those in serial mesh refinement because of the parallel processing.

Domain decomposition is also used in line with parallel implementation of the current Poisson’s equation solver. Each spatial sub-domain belongs to a specific processor in practice. The overall procedure shown in Fig. 2.7 can be summarized as follows:

1. Preprocess the input data at the host processor, and distribute them to all other processors.

2. Index the cells which require refinement based on the refinement criteria.

In the current study, we use the variation of potentials among elements as the criterion for cell refinement which, in practice, is equivalent to a generally accepted error estimator as will be shown in the next section.

3. Check if further mesh refinement is necessary. If it is, then proceed to the next step. If not, proceed to Step 9.

4. Add new nodes into those cells that require refinement.

4a. Add new nodes onto all edges of isotropic cells.

4b. Add new nodes into the anisotropic cells which require further refinement as decided upon in the following steps.

4c. Communicate the hanging-node data to corresponding neighboring processor if the hanging nodes are located at IPB.

4d. Remove the hanging nodes following the procedures as shown in Wu, et al. [Wu et. al., 2004]. The basic idea is to remove the hanging nodes for

all kinds of conditions, and then refine the cell into two, four, or eight child cells.

5. Unify the global node and cell numberings caused by the newly added nodes in all processors.

5a. Add up the number of the newly added nodes in each processor, excluding those located at interprocessor boundaries (IPBs).

5b. Gather this number from all other processors, and add them up to obtain the updated total number of nodes, including old and new nodes, but excluding the newly added nodes at IPBs.

5c. Build up the updated node-mapping and corresponding cell-mapping arrays for those newly added nodes in the interior part of each sub-domain based on the results in Step 5b.

5d. Communicate the data of newly added nodes at the IPBs among all processors.

5e. Build up the node-mapping array for the new nodes received at IPBs in each processor.

6. Build up new connectivity data for all cells to include the newly added nodes.

7. Build up the new neighbor-identifying array based on the new connectivity

data obtained in Step 6.

7a. Reset the neighbor-identifying array.

7b. Build up the neighbor-identifying arrays for all cells based on the new connectivity data, excluding the data associated with the faces lying on the IPBs that require the updated information of the global cell number which is not yet known at this stage.

7c. Record all the neighbor-identifying arrays that have not been rebuilt in Step 7b.

7d. Broadcast all the recorded data in all processors.

7e. Build up the neighbor-identifying arrays on the IPBs, considering the overall connectivity data structure.

8. Decide if it reaches the preset maximum number of refinement. If it does, then proceed to the next step. Otherwise, return to Step 3.

9. Synchronize all processors.

10. The host processor gathers and outputs the data.

In the current study, by coupling the PAMR with the parallel Poisson’s equation solver as stated in Step 3, the maximum number of refinement is set to be “one”, since the option whether further refinement is necessary is decided outside the PAMR, as can be seen in the next section.

2.9 Coupling of PAMR with Parallel Electrostatic Field Solver

The PAMR presented in the previous section can be easily coupled to the current parallel electrostatic field solver since both utilize 3-D unstructured tetrahedral mesh and MPI for data communication. One can readily wrap up the PAMR as a library and insert it into the source code of any parallel numerical solver to be used. However, some problems may occur due to memory conflicts between the inserted library and the numerical solver itself that could reduce the problem size one can handle in practice. As such, a simple coupling procedure, written in shell script (Fig. 2.8) that is standard on all Unix-like systems, can be prepared to link the PAMR and the current parallel electrostatic field solver. In doing so, we can keep the source codes intact and without alterations. Indeed, it is especially justified if only a steady state of the physical problem is sought, in which normally only several times of mesh refinement is enough to have a fairly satisfactory solution. Thus, the total I/O time, which is in proportion to the number of couplings in switching between two codes, can be reduced to a minimum in practical applications. In addition, as shown in Fig. 2.8, after identifying those cells that require refinement before PAMR, the domain is repartitioned based on the new mesh refinement requirements. For example, the weight factors of the cells (vertex in graph theory) are set as eight for those cells which are flagged to be refined; otherwise, they are set as unity. With this distribution

of weight factors as the input to ParMETIS [Karypis, 1998], an approximate (but

of weight factors as the input to ParMETIS [Karypis, 1998], an approximate (but