Implementation of iterative solvers on GPU

One of the applications that might be benefitted from the enormous computational power of modern GPU is the application of iterative solvers to solve a large system of linear equations. The main objective of this chapter is to solve the finite element equations discretized from the incompressible Navier-Stokes equations on GPU. Special attention was devoted to the sparse matrix-vector product (SpMV) operation due to its importance in applying an iterative solver [98–100]. A good summary on the development of high performance SpMV implementation on recent multicore and accelerator-based hardware can be found in [101]. Some iterative solvers implemented on GPU architecture have been reported in [102–104].

In Chapter 5, the efficiency of using the pre-conditioned conjugate gradient (PCG) solver for solving the normalization matrix equations has been demonstrated. Now, the current PCG solver will be implemented on GPU platform to obtain convergent solution even quickly. The PCG algorithm is described as follows :

Starting from an initial guess x₀

Mrepresents the Jacobi pre-conditioner

Compute x^′₀=∑eAex0, x^′′₀=∑e(A_e)^Tx^′₀, b^′=∑e(A_e)^Tb //matrix-vector product ; Compute the initial residual r₀= b^′− x^′′₀ // global update ;

Solve M z0= r0 // pre-conditioning ;

p₀= z₀

For j = 1, 2, . . . , p^′_j−1=∑e(A_ep

j−1) // matrix-vector product ; p^′′_j₋₁=∑e[(A_e)^Tp^′_j−1] // matrix-vector product ; αj−1= (p

j−1, r_j−1)/(p

j−1, p^′′_j−1) // inner product ; x_j= x_j₋₁+αj−1p_j−1 // global update ; r_j= r_j−1− αj−1p^′′_j₋₁ // global update ; Convergence check

Solve M z_j= r_j // pre-conditioning ; βj−1= (z_j, r_j)/(r_j−1, r_j−1) // inner product ; pj= z_j+βj−1p_j−1 // global upadte ; End

The main operations of PCG solver consist of (i) global update operation ; (ii) pre-conditioning operation ; (iii) inner-product operation ; and (iv) matrix-vector product operation. The main computational cost is the matrix-vector product operation (iV), in particular in large-sized problem, while the cost of other operations is relatively small.

The operations (i) and (ii) are naturally parallelized owing to the chosen Jacobi pre-conditioner. The implementations of operations (iii) and (iv) will be described in detail.

6.5.1 Inner product operation on GPU

Calculation of the inner product operation (iii) on GPU is implemented in two steps.

Firstly, the temporal vector c with size N is constructed to store the vector-vector product c_i = a_ib_i (i = 1, ..., N) of vectors a and b in GPU. Secondly, all the components in c are summed up to get the inner product (a, b) =∑^Ni=1c_i. The above procedures can be completed with two kernel functions: One is used for the vector-vector product and the other one is used for the calculation of the sum.

6.5.2 Matrix-vector product operation on GPU

It is proven that the SpMV operation is regarded as the most computationally intensive part in most of the Krylov subspace iterative solvers. Great efforts have been made to improve the performance of parallel SpMV implementations. Several research groups have reported their implementations on high-performance SpMV on GPU [98–100]. A good summary on the development of high performance SpMV implementation on recent multi-core and accelerator-based hardware can be found in [101]. In these literatures, the global matrix is, however, usually stored in some specific sparse matrix formats. The choice of sparse matrix format essentially depends on the feature of the matrix and the number of non-zero entries. The survey of sparse matrix can be found in the literature [72]. Among them, the CSR format is the most common one and has been frequently used.

As mentioned already, the use of sparse matrix format still encounters memory inten-sive requirement problem because the number of non-zero entries increases significantly as the problem size increases. In this study, we aimed to compute the matrix-vector prod-uct without using any sparse matrix format. To this end, the EBE technique will be im-plemented on GPU. Kiss et al. [105] also presented the similar technique, but the detailed algorithm is not given therein.

In EBE context, the global matrix-vector A x can be decomposed into the sum of the element-level matrix-vector (EMV) A^ex^e

A x =

Nel

∑

e=1

(B^e)^T( A^ex^e)

(6.1) where B^e denotes the Boolean matrix, which maps the entries of e-th element matrix A^e into a global matrix A. Calculation of the global matrix-vector product on GPU via Eq. (6.1) is, therefore, a challenging task.

In the developed EMV kernel function (Algorithm 6.1), each block with nethreads (ne

being the local degree of freedom) is responsible for computing the product of one row

one element matrix-vector product. The above algorithm is described in Algorithm 6.1 Algorithm 6.1:EMV kernel function

1 A^e(n_e, n_e) : The e-th elementary matrix element matrix ;

The Algorithm 6.1 may, however, result in an incorrect result due to the presence of the so-called race condition [105]. Race condition occurs in GPU computation if any two threads try to read or write the same GPU memory location simultaneously. Such a read or write leads to information loss and incorrect result. Since in the CUDA execution no information is given about which thread performs the specific task, the fastest thread will perform the task.

To avoid race condition problem, one can use the atomic update or the element color-ing technique [105]. The former refers to protect the memory space durcolor-ing I/O causcolor-ing other threads to access the same memory space to wait until the operation is fully com-pleted. The latter refers to partition elements into a finite disjoint element subsets marked with different colors such that any two elements in a given subset are not allowed to share the same global node, as illustrated in Figs. 6.6-6.7. Different colors are displaced in pro-cess serially, while elements within the subset marked with the same color are invoked in parallel. In this study, the element coloring strategy shown in Algorithm 6.2 is adopted because it is effective and run only once during the 2D/3D mesh generation.

The Algorithm 6.2 is very simple, but it may generate color distributions that are usually uneven. We therefore add a second step to balance the number of elements inside

Algorithm 6.2:2D/3D Mesh coloring algorithm

1 Nel : The number of elements ;

E

: A collection of elements ;

E

^e : The e-th element ;

4 Color(Nel) : The color ID of each element ;

5 for e=1→ Nel do

6 Cused(1) = 1 , . . . , Cused(Nel)=Nel // Initialize the Cused array ;

7 for k = 1→ Nel-1 do

8 check the all neighbors of e-th element

E

^k∈

E

^e ;

9 if (

E

^kis colored) then

10 Cused(Color(k)) = Nel+1

11 end

12 end

13 Color(e) = Min{Cused} // mark the element with color ;

14 end

each mesh subset of a given color, as stated in [106].

在文檔中發展一波數誤差最佳化有限元素 GPU 平行計算模型以求解不可壓縮 Navier-Stokes 方程式 (頁 138-142)