Conclusion remarks - 發展一波數誤差最佳化有限元素 GPU 平行計算模型以求解不可壓縮 Navier-Stokes 方程式

The current finite element calculation of three-dimensional incompressible Navier-Stokes equations has been carried out in tri-quadratic/tri-linear elements. The interpolation func-tions for the primitive variables u and p satisfy the

LBB

compatability condition to cir-cumvent the oscillatory pressure solution. To enhance numerical stability in association with the predicted velocity, the stabilized term is added along the streamline direction. To minimize the wavenumber error for the convection term, a proper upwinding coefficient is rigorously chosen. Prior to the calculation of solution from finite element matrix equa-tions, which has been transformed to a SPD matrix through the normalization procedure of the original unsymmetric and indefinite mixed finite element equations. This procedure is essential to reduce the condition number of the normalized SPD matrix equations. Both of the Jacobi and polynomial pconditioners are investigated. To reduce the memory re-quirement, the element-by-element strategy is implemented in the iterative solver. The performance of three iterative solvers is also assessed. It is concluded that the use of CG iterative solver together with the pre-conditioner for solving the normal matrix equations is superior to the pre-conditioned GMRES and BICGSTAB iterative solvers for solving the original matrix equations.

Number of elements

Solver 4³ 8³ 16³

u 1.532×10⁻³ 8.241×10⁻⁴

–

Frontal v 8.973×10⁻³ 5.489×10⁻³

–

w 9.338×10⁻³ 6.044×10⁻³

–

p 3.194×10⁻² 2.069×10⁻²

–

u 1.460×10⁻³ 8.253×10⁻⁴ 4.191×10⁻⁴ BICGSTAB v 8.915×10⁻³ 5.535×10⁻³ 2.761×10⁻³ w 9.479×10⁻³ 6.004×10⁻³ 3.065×10⁻³ p 3.071×10⁻² 1.477×10⁻² 6.629×10⁻³ u 1.460×10⁻³ 8.253×10⁻⁴ 4.191×10⁻⁴ GMRES v 8.915×10⁻³ 5.535×10⁻³ 2.761×10⁻³ w 9.479×10⁻³ 6.004×10⁻³ 3.065×10⁻³ p 3.073×10⁻² 1.481×10⁻³ 6.695×10⁻³ u 1.460× 10⁻³ 8.254×10⁻⁴ 4.191×10⁻⁴ CG v 8.915× 10⁻³ 5.535×10⁻³ 2.761×10⁻³ w 9.479× 10⁻³ 6.004×10⁻³ 3.066×10⁻³ p 3.533× 10⁻² 1.481×10⁻² 6.661×10⁻³

Table 5.1: The computed L₂ error norms for the problem considered in Sec. 5.10.1. In this table, ”-” means that the solution is not computable.

Solvers Total CPU time (s) CPU time and percentage in ( ) for iterative solver

BICGSTAB 4,608.6 4,527.5 (98.24%)

GMRES 11,459.1 11,257.4 (98.23%)

CG 3,015.2 2,967.4 (98.71%)

Table 5.2: The total CPU times for three solvers in 16³tri-quadratic elements.

Solvers Reynolds CPU No. of outer No. of inner iterations number time (s) iterations at the n-th outer iteration

100 682.7 8 1,465 (n=4)

BICGSTAB 400 × × Break down

1000 × × Break down

100 300.2 6 430 (n=3)

GMRES 400 1,238.4 8 1,350 (n=4)

1000 2,623.3 9 2,120 (n=5)

100 795.4 8 2,073 (n=4)

CG 400 862.2 8 2,788 (n=4)

1000 1,937.7 10 5,463 (n=5)

Table 5.3: The finite element solutions computed from three iterative solvers for the problem considered in a domain of 21³ nodal points. The notation ”×” shown is this table means that the solution is not computable.

Solvers Reynolds CPU No. of outer No. of inner iterations number time (s) iterations at the n-th outer iteration

100 5,853.1 7 3,040 (n=3)

BICGSTAB 400 30,622.5 8 5,585 (n=4)

1000 × × Break down

100 4,652.1 6 850 (n=3)

GMRES 400 12,468.4 7 1,760 (n=4)

1000 × × Break down

100 4,384.9 6 3,460 (n=3)

CG 400 9,490.4 7 4,428 (n=4)

1000 29,845.3 11 7,611 (n=6)

Table 5.4: The finite element solutions computed from three iterative solvers for the problem considered in a domain of 41³ nodal points. The notation ”×” shown in this table means that the solution is not computable.

Solvers Reynolds CPU No. of outer No. of inner iterations number time (s) iterations at the n-th outer iteration

100 57,630.5 6 3,090 (n=3)

BICGSTAB 400 172,855.2 8 9,995 (n=4)

1000 × × Break down

100 71,598.2 6 2,840 (n=3)

GMRES 400 151,312.7 7 4,070 (n=4)

1000 × × Break down

100 46,097.8 6 4,281 (n=3)

CG 400 88,192.9 8 6,001 (n=4)

1000 272,705.7 12 10,657 (n=6)

Table 5.5: The finite element solutions computed from three iterative solvers for the problem investigated in a domain of 61³ nodal points. The notation ”×” shown in this table means that the solution is not computable.

Solvers Reynolds CPU No. of outer No. of inner iterations number time (s) iterations at the n-th outer iteration

100 294.6 6 1,400 (n=3)

PBICGSTAB 400 3,852.5 8 1,870 (n=4)

1000 × × Break down

100 254.1 6 300 (n=3)

PGMRES 400 1,033.5 8 930 (n=4)

1000 2,094.9 9 1,470 (n=5)

100 93.2 6 364 (n=3)

PJCG 400 362.5 8 1,230 (n=4)

1000 1,466.9 15 2,631 (n=8)

100 107.5 6 361 (n=3)

PPCG 400 425.4 8 1,229 (n=4)

1000 1,051.4 9 2,600 (n=5)

Table 5.6: The finite element solutions computed from three iterative solvers for the problem investigated in a domain of 21³ nodal points. The notation ”×” shown in this table means that the solution is not computable. PJCG : CG iterative solver used together with Jacobi pre-conditioner ; PPCG : CG iterative solver used together with polynomial pre-conditioner.

Solvers Reynolds CPU No. of outer No. of inner iterations number time (s) iterations at the n-th outer iteration

100 5,510.3 6 2,600 (n=3)

PBICGSTAB 400 10,297.1 8 4,870 (n=4)

1000 × × Break down

100 3,935.1 6 600 (n=3)

PGMRES 400 10,537.0 7 1,610 (n=4)

1000 × × Break down

100 1,918.3 6 986 (n=3)

PJCG 400 4,248.3 7 1,960 (n=4)

1000 13,167.7 9 5,017 (n=5)

100 2,711.5 6 978 (n=3)

PPCG 400 5,799.5 7 1,498 (n=4)

1000 12,230.8 9 3,301 (n=5)

Table 5.7: The finite element solutions computed from three iterative solvers for the problem investigated in a domain of 41³ nodal points. The notation ”×” shown in this table means that the solution is not computable.

Solvers Reynolds CPU No. of outer No. of inner iterations number time (s) iterations at the n-th outer iteration

100 49,805.3 7 5,455(n=4)

PBICGSTAB 400 98,022.3 8 6,615(n=4)

1000 × × Break down

100 36,839.9 6 930(n=3)

PGMRES 400 82,338.9 7 1,670(n=4)

1000 × × Break down

100 25,153.4 6 1,996 (n=3)

PJCG 400 46,770.4 10 2,149 (n=5)

1000 172,560.3 15 6,218 (n=7)

100 32,717.1 6 2,395 (n=3)

PPCG 400 51,790.1 7 2,960 (n=4)

1000 133,386.8 11 5,329 (n=5)

Table 5.8: The finite element solutions computed from three iterative solvers for the problem investigated in a domain of 61³ nodal points. The notation ”×” shown in this table means that the solution is not computable.

: u , v , w : u , v , w , p

Figure 5.1: Schematic of the primitive variable storing in a tri-quadratic element

Continuity eq.

Momentum eqs.

BC (Dirichlet type) Newton linearization

(a)

Continuity eq.

Momentum eqs.

BC (Dirichlet type) Newton linearization

diagonal zero

(b)

Figure 5.2: Illustration of a 402 × 402 finite element matrix equation derived in the 2³ tri-quadratic elements. The green square, red diamond, blue circle and black triangle symbols represent the non-zero entries contributed from the continuity equation, momen-tum equations, Dirichlet-type boundary data and Newton linearization, respectively. (a) Global matrix profile; (b) Matrix profile in the dashed block of matrix (a).

Element ID : 1

Figure 5.3: Illustration of the Dirichlet boundary condition implementation on boundary elementary matrix

Real part

Imaginarypart

-4 -2 0 2

-2 0 2

Eigenvalues of I - D^-1A^TA

Unit circle

Figure 5.4: Computed eigenvalues for the matrix equation I− D⁻¹A^TA

Scaling parameterω

Spectralradius

0 0.02 0.04 0.06 0.08 0.1

0.95 1 1.05

spectral radius 1

ω = 0.05

spectral radius of I-ωD^-1A^TA D = diag(A^TA)

Figure 5.5: Predicted spectral radius of I− ω D⁻¹A^TAversus the scaling parameter ω.

Real part

Imaginarypart

-3 -2 -1 0 1 2 3

-2 -1 0 1 2

Eigenvalues of I -ωD^-1A^TA

Unit circle

Max eigenvalue

= 0.999999894 < 1

ω = 0.05

Figure 5.6: Computed eigenvalues for the matrix equation I− ω D⁻¹A^TA

Scaling parameterω

Spectralradius

0 0.02 0.04 0.06 0.08 0.1

0 0.5 1 1.5 2

Spectral radius of I-ωD^-1A Spectral radius 1

D = diag(A)

− ω D⁻¹ ω.

Figure 5.8: Schematic of the three-dimensional lid-driven cavity problem considered in Sec. 5.10.2

Figure 5.9: Comparison of the predicted velocity profiles at the mid-plane y = 0.5. (a) Re = 400; (b) Re = 1000.

Iteration number

Residual

2000 4000

10^-11 10^-9 10^-7 10^-5 10^-3

10^-1 BICGSTAB

GMRES CG

(a)

Iteration number

Residual

1000 2000 3000 4000

10^-11 10^-9 10^-7 10^-5 10^-3 10^-1

PBICGSTAB PGMRES PJCG PPCG

(b)

Figure 5.10: The residual reduction plots in one inner iteration (at the fourth outer iter-ation) using the investigated iterative solvers to solve the incompressible Navier-Stokes

Iteration number

Residual

1000 2000 3000 4000 5000

10^-11 10^-10 10^-9 10^-8 10^-7 10^-6 10^-5

10^-4 PJCG

PPCG

Figure 5.11: The residual reduction plots in one inner iteration (at the fifth outer iter-ation) using the investigated pre-conditioned iterative solvers to solve the Navier-Stokes equations at Re = 1000 in 41³nodal points.

Chapter 6 Parallel computing on GPU

It is well known that the calculation of finite element solution for the incompressible Navier-Stokes equations is a computationally expensive task. In order to reduce the com-puting time, computers have been developed with many cores executing in parallel (clus-ters). However, these clusters are too expensive for most of the researchers.

In the past two decades, the Graphic Processing Units (GPUs) have experienced an amazing evolution and become the trend in high performance computing community.

From a simple device to generate and exhibit the 2D/3D graphics to a highly parallel, multithread, many-core device with enormous computational power and large memory bandwidth. Today, they have been widely applied to many computational areas due to the characteristic of massive multi-core parallelization, delivery of high throughput on double-precision arithmetic and larger memory bandwidth compared with the traditional CPU processor. This chapter introduces the GPU computing environment and the imple-mentation details of the developed finite element GPU code. Firstly, the short history of GPU and its hardware architecture are introduced. Secondly, the programming model for GPU will be introduced. Due to the fundamental design difference between the CPU and the GPU, some implementation details are introduced in order to obtain the finite element solution on GPU. For the best speedup purpose, some optimization techniques are also introduced. The developed GPU finite element code will be firstly verified by solving the problem amenable to analytical solution. The benchmark lid-driven cavity problem is then solved for the validation and performance analysis study.

6.1 Introduction

In the past decade, the microprocessors based on a single CPU (Central Processing Unit), such as those in the Intel Pentium family and the AMD Opteron family, have driven a rapid performance increase and a dramatic cost reduction in high performance comput-ing. Currently, these microprocessors have reached GFLOP/s and TFLOP/s floating point performance¹per second to the desktop and the cluster, respectively.

However, the development of CPU became slow since 2003 due to energy consump-tion and heat dissipaconsump-tion issues that limited the increase of clock rate and the level of procedure activities that can be performed in each clock cycle within a single CPU. To in-crease the computational power, multiple processing cores were placed on the same chip.

This hardware change has exerted a tremendous impact on the software developer com-munity. These multi-core processors are now very useful in general purpose computation but still lack appealing potential for high performance mathematical computations.

The GPUs were, unlike the CPUs, originally designed for rendering high resolution graphics, mainly for the computer game industry. Graphic rendering is in nature highly parallel, involving a large number of arithmetic operations on large data sets of pixels and vertices. The demand for real time, high resolution 2D/3D graphics, has led to an evolu-tion of GPUs into devices with enormous parallel computing capability [90, 91]. Several orders of magnitude faster than the fastest multi-core CPU have been realized. Fig. 6.1 depicts the growing gap in peak performance which is measured in FLOP/s between the CPUs and the GPUs over the last decade. It clearly reveals that NVIDIA’s GPUs have outperformed Intel CPUs quite substantially.

The memory bandwidth², which represents how fast the memory can be transfered between the CPU/GPU and the DRAM memory, is another important factor which can affect the performance of many computational tasks. Currently, GPUs have operated sev-eral times the memory bandwidth of CPUs shown in Fig. 6.2. The above facts reveal

1Floating point operations per second (FLOPS) is a measure of processor’s performance. 1 GFLOP/s = 10⁹FLOP/s, 1 TFLOP/s = 10¹²FLOP/s

2Bandwidth is an amount of data that can be transferred in a given amount of time

that GPUs are considered to be highly parallel multi-core processors having a tremen-dous floating point performance peak (GFLOP/s) and higher memory bandwidth (GB/s) compared with the CPU.

While the development of GPU hardware has evolved toward more unified proces-sors, it increasingly resembled high performance parallel computers. Some researchers took notice of its potential computing power for general purpose computing and they started trying to explore the use of GPU to solve the problems in science and engineering communities. The field of GPGPU (General Purpose Graphic Processing Unit) has arisen to be an important topic. The utilization of computational power of the GPU, however, does not come for free. Programmers need to pose their problems as graphic rendering tasks so that computations can be executed on GPUs via open library (e.g. OpenGL) or graphic runtime API (Application Programming Interface) routines. This results in a high learning threshold for programmers who are not familiar with the graphic programming.

In addition, the programming environment is very limited when it comes to the stage of debugging

The first breakthrough of GPGPU programming was achieved due to the introduction of Brook [92]. Brook, developed by researchers at Stanford University in 2004, provided the programmers with an extension to C-based programming language that allows access-ing GPUs for non-graphical computation. It keeps the programmers away from havaccess-ing a knowledge of graphic programming language, programs written for GPUs using Brook were able to run up to eight times faster than the similar code written for CPUs. Almost at the same time, another similar programming model called Sh, which was developed by researchers at Waterloo University [92], was presented as a library on top of C++. Like Brook, Sh also supports graphic programming and adopts a stream concept for general purpose programming.

In 2006, graphic programming took a leap forward when NVIDIA released its first fully programmable G80-series GPU processor, built on what is called the Tesla architec-ture, giving the birth of GPGPU computing to a wider public. Later on, a new

program-ming model called CUDA (Compute Unified Device Architecture) [91] was introduced by NVIDIA in 2007. CUDA is a general-purpose parallel computing architecture consists of its own programming and execute model. This new parallel architecture allows the use of NVIDIA’s GPU to solve many computationally intensive problems faster than with the CPU counterparts. It provides programmers with a set of runtimes API that allow parallel code to be executed only for NVIDIA’s GPUs without necessity of learning too much of the graphic programming language. Therefore, GPUs used in non-graphic applications have become an easy task. One of the NVIDIA’s goals in the creation of CUDA is to allow programmers with certain background in common programming language (C/C++, Fortran) to be able to access NVIDIA’s GPUs as much easy as possible.

The GPUs have found the way into today supercomputers, which increasingly use GPUs for accelerating calculation and reducing power consumption. The newest Top500 list presented in June 2015 showed that three supercomputers on Top10 are composed of CPUs and GPUs. Among them, the TITAN supercomputer in the Oak Ridge National Laboratory is now ranked No. 2 and it was built with both CPUs and GPUs. The increas-ing use of GPUs in high performance computincreas-ing industry has made Intel come up with a competing device, the Intel Xeon Phi accelerator [93]. The interest of major develop-ers such as Intel, NVIDIA, and AMD, in the many-core architecture computing industry, promises good future development in the high performance computational area.

在文檔中發展一波數誤差最佳化有限元素 GPU 平行計算模型以求解不可壓縮 Navier-Stokes 方程式 (頁 115-131)