Introduction - Performance analysis - 發展一波數誤差最佳化有限元素 GPU 平行計算模型以求解不可壓縮 Navier-Stokes 方程式

6.9 Performance analysis

6.9.1 Introduction

In this section, a performance comparison between the developed CPU and GPU code is discussed, with the goal of exploring the benefit of running the computer program on GPU. The benchmark lid-driven cavity problem investigated before is used to perform computations at different grid sizes to assess the speedup performance.

Since GPU is suitable for parallel algorithm, especially for those with high data paral-lelism. The performance analysis in this section is based on the degree of grids refinement.

By increasing the grid sizes, the amount of load and computing time also increases sig-nificantly. In order to analyze the performance, a summary of some important routines in the current developed code is listed in Table 6.5. The bold red triangle symbol indicates that the routine is fully executed on CPU, whereas the green circle symbol indicates that the routine is fully executed on GPU. Note that the hybrid CPU/GPU platform requires three additional routines in comparison with the CPU platform counterpart. This table is further used to address the time measurement obtained from both CPU and GPU codes.

Following the nomenclature given to each routine, a short description of some routines is given below.

• Data preparation

This routine consists of loading the domain coordinate, mesh connectivity informa-tion and boundary condiinforma-tion data. In addiinforma-tion, the values of interpolainforma-tion funcinforma-tions, derivative of interpolation functions and weighted factors associated with the total Gaussian quadrature points are also evaluated. Moreover, the Boolean matrices for each element are constructed in this routine.

• Element coloring

The coloring routine (Algorithm 6.2) runs on CPU since it is effective and runs only once. Note that this routine is only executed on a hybrid CPU/GPU platform.

• Initial condition

The steady-state solutions at low Reynolds number are used to be the initial condi-tion for all the problems at high Reynolds number.

Time measurements of each routine for the CPU and GPU code are obtained via the command clock system(). Note that the command Syncthreads() in GPU code must precede the time measurement command in order to guarantee all the threads in GPU have completed all computations.

The two- and three-dimensional lid-driven cavity flow problems at two different Reynolds number are used to assess the performance of CPU and CPU/GPU platforms. The details of the grid sizes used in the performance analysis are listed in Tables. 6.6 and 6.7. The nomenclature introduced in Table 6.5 will be useful to clearly indicate the most time-consuming routine and how they scale with mesh refinement.

The first task toward the performance analysis is to justify the need of GPU platform.

For all the considered cases, the computation time and its relative time of each routine ex-ecuted on CPU platform are tabulated in Tables. 6.9.1-6.9.1. One can clearly see that the

SOLV routine is the most time-consuming, up to 99% of the total computation time. The computation time for executing other routines is relatively small and can be negligible.

It is well known that the good parallel performance is achieved if the most computa-tionally intensive part is accelerated. This justifies the need for a complete implementation of the iterative solver on GPU in order to take advantage of the available GPU computing power. The computation time of each routine executed on a hybrid CPU/GPU platform is tabulated in Tables. 6.12-6.15 and the speedup ratio between the these two platforms are summarized in Table. 6.16- 6.19 by presenting the total computational time and the re-spective speedup ratio. The speedup indicates how fast the GPU runs in comparison with its CPU counterpart. The speedup ratio for all considered cases are plotted in Figs. 6.13.

The speedup increases with respect to the increasing grid sizes and Reynolds number.

Thus the increase of computation tasks raises the speedup ratio considerably higher.

CPUGPU1GPU2 ProcessorInteli7-4820KNVIDIAKeplerK20NVIDIAKeplerK40 Clockrate3.7GHz705MHz745MHz Numberofcores82496(SP)2880(SP) 832(DP)960(DP) off-chipmemory32GB(DDR3)5GB(DDR5)12GB(DDR5) L1-cache:256KB/coreConstantmemory:64KBConstantmemory:64KB on-chipmemoryL2-cache:1024KB/coreSharedmemory:48KB/BlockSharedmemory:48KB/Block L3-cache:10MB/coreRegister:64KB/BlockRegister:64KB/Block Peakflops59.2GB/s(DP)3.52TB/s(SP)4.29TB/s(SP) 1.17TB/s(DP)1.43TB/s(DP) Memorybandwidth59.7GB/s208GB/s288GB/s IEEE754single/doubleyes/yesyes/yesyes/yes SP:singleprecision DP:doubleprecision Table6.1:SomekeyspecificationsoftheconsideredCPUprocessor,andtheNVIDIAK20andK40GPUcards

Memory Location Cached Device access Scope Lifetime

Shared On-chip N/A R/W All threads in block Thread block

Local DRAM Yes R/W One thread Thread

Global DRAM Yes R/W All threads in block Program

Constant DRAM Yes R All threads in block Program

Texture DRAM Yes R All threads in block Program

Table 6.2: Some features of different GPU device memory types

Grid size Platform ||u − uexact||2 ||v − vexact||2 ||p − pexact||2

21² CPU 6.182×10⁻⁵ 5.708×10⁻⁶ 2.785×10⁻⁵ Hybrid 6.182×10⁻⁵ 5.708×10⁻⁶ 2.785×10⁻⁵ 41² CPU 5.656×10⁻⁶ 4.392×10⁻⁷ 4.464×10⁻⁶ Hybrid 5.656×10⁻⁶ 4.392×10⁻⁷ 4.464×10⁻⁶ 61² CPU 1.630×10⁻⁶ 1.494×10⁻⁷ 1.949×10⁻⁶ Hybrid 1.630×10⁻⁶ 1.495×10⁻⁷ 1.950×10⁻⁶ 81² CPU 7.183×10⁻⁷ 9.592×10⁻⁸ 1.189×10⁻⁶ Hybrid 7.183×10⁻⁷ 9.591×10⁻⁸ 1.189×10⁻⁶ Note : Hybrid denotes the hybrid CPU/GPU platform

Table 6.3: The computed L₂error norms obtained at different grids for the 2D verification problem considered in Sec. 6.8.1.

Grid size Platform ||u − uexact||2 ||v − vexact||2 ||w − wexact||2 ||p − pexact||2

21³ CPU 2.450×10⁻⁴ 1.739×10⁻³ 1.998×10⁻³ 9.945×10⁻³ Hybrid 2.450×10⁻⁴ 1.739×10⁻³ 1.997×10⁻³ 9.943×10⁻³ 41³ CPU 1.451×10⁻⁴ 9.783×10⁻⁴ 1.170×10⁻³ 4.936×10⁻³ Hybrid 1.451×10⁻⁴ 9.783×10⁻⁴ 1.170×10⁻³ 4.935×10⁻³ 61³ CPU 1.033×10⁻⁴ 6.791×10⁻⁴ 8.248×10⁻⁴ 3.305×10⁻³ Hybrid 1.032×10⁻⁴ 6.791×10⁻⁴ 8.247×10⁻⁴ 3.303×10⁻³ 81³ CPU 8.000×10⁻⁵ 5.196×10⁻⁴ 6.354×10⁻⁴ 2.480×10⁻³ Hybrid 8.000×10⁻⁵ 5.196×10⁻⁴ 6.365×10⁻⁴ 2.481×10⁻³ Table 6.4: The computed L₂error norms obtained at different grids for the 3D verification

Routines Nomenclature CPU GPU+CPU

Data preparation DATA N N

Elements coloring ECOL – N

Host to device HTOD – N

Compute all A^e MATX N N

Compute A^Tb NORB N •

Compute M PREC N •

Solve A^TA = A^Tb SOLV N •

Device to host DTOH – N

M: Jacobi pre-conditioner

Table 6.5: Summary of the most important routines on the implementation platforms

M1 M2 M3 M4 M5

Element sizes 1600 2500 3600 6400 10000

Node sizes 6561 10201 14641 25921 40401

Total DOF 14803 23003 33003 58403 91003

Table 6.6: Meshes used in the performance analysis of the 2D lid-driven cavity problem considered in Sec. 6.8.3.

M1 M2 M3 M4 M5

Element sizes 1000 3375 8000 27000 64000

Nodal sizes 9261 29791 68921 226981 531441

Total DOF 29114 93469 216024 710734 1663244

Table 6.7: Meshes used in the performance analysis of the 3D lid-driven cavity problem considered in Sec. 6.8.3.

M1 M2 M3 M4 M5

DATA 0.304 0.475 0.868 1.404 2.277

(0.021%) (0.016%) (0.016%) (0.010%) (0.007%)

MATX 2.037 2.925 4.086 7.330 11.241

(0.143%) (0.101%) (0.076%) (0.052%) (0.035%)

NORB 0.245 0.378 0.487 0.892 1.349

(0.017%) (0.013%) (0.009%) (0.006%) (0.004%)

PREC 0.407 0.609 0.814 1.499 2.355

(0.029%) (0.021%) (0.015%) (0.011%) (0.007%)

SOLV 1362.064 2845.961 5313.823 14162.614 32029.803

(99.781%) (99.846%) (99.882%) (99.922%) (99.772%)

Total 1365.057 2850.348 5320.078 14173.739 32047.025

Table 6.8: Timing in seconds of each routine executed on CPU platform with different nodal points for the 2D lid-driven cavity flow problem investigated at Re = 1000. The value is ”()” denotes the relative time.

M1 M2 M3 M4 M5

DATA 0.302 0.472 0.689 1.248 1.914

(0.007%) (0.006%) (0.005%) (0.004%) (0.003%)

MATX 2.813 3.108 4.410 8.213 12.421

(0.070%) (0.038%) (0.031%) (0.024%) (0.019%)

NORB 0.319 0.381 0.531 1.031 1.519

(0.008%) (0.005%) (0.004%) (0.003%) (0.002%)

PREC 0.508 0.657 0.914 1.541 2.623

(0.013%) (0.008%) (0.006%) (0.005%) (0.004%)

SOLV 4025.668 8245.954 14120.571 34015.611 66933.028

(99.902%) (99.944%) (99.954%) (99.965%) (99.972%)

Total 4029.610 8250.572 14127.115 34027.644 66951.505

Table 6.9: Timing in seconds of each routine executed on CPU platform with different nodal points for the 2D lid-driven cavity flow problem investigated at Re = 5000. The value is ”()” denotes the relative time.

M1 M2 M3 M4 M5

DATA 4.227 17.721 49.545 245.864 939.764

(0.179%) (0.184%) (0.167%) (0.158%) (0.162%)

MATX 12.027 36.909 94.396 554.422 1862.125

(0.508%) (0.384%) (0.317%) (0.355%) (0.321%)

NORB 0.452 1.310 3.104 17.117 69.612

(0.019%) (0.014%) (0.010%) (0.011%) (0.012%)

PREC 1.918 5.772 13.915 101.148 307.454

(0.081%) (0.060%) (0.047%) (0.065%) (0.053%)

SOLV 2347.659 9544.219 29570.941 154695.416 576922.149

(99.213%) (99.358%) (99.459%) (99.411%) (99.452%)

Total 2366.283 9605.931 29731.901 155611.971 580101.103

Table 6.10: Timing in seconds of each routine executed on CPU platform with different nodal points for the 3D lid-driven cavity flow problem investigated at Re = 400. The value is ”()” denotes the relative time.

M1 M2 M3 M4 M5

DATA 4.118 17.706 49.514 343.546 885.999

(0.049%) (0.066%) (0.062%) (0.063%) (0.061%)

MATX 11.996 36.551 82.742 577.879 1568.655

(0.142%) (0.136%) (0.104%) (0.106%) (0.108%)

NORB 0.374 1.029 2.745 27.258 87.148

(0.004%) (0.004%) (0.004%) (0.005%) (0.006%)

PREC 1.918 5.070 12.027 70.872 334.066

(0.023%) (0.019%) (0.015%) (0.013%) (0.023%)

SOLV 8403.321 26834.044 79235.325 544148.948 1449582.869

(99.781%) (99.776%) (99.815%) (99.813%) (99.802%) Total 8421.727 26894.400 79382.353 545168.143 1452458.737 Table 6.11: Timing in seconds of each routine executed on CPU platform with different nodal points for the 3D lid-driven cavity flow problem investigated at Re = 1000. The value is ”()” denotes the relative time.

M1 M2 M3 M4 M5

DATA 0.321 0.505 0.716 1.298 1.989

ECOL 0.217 0.373 0.609 0.921 1.879

HTOD 0.246 0.385 0.533 0.929 1.424

MATX 2.037 3.509 4.541 8.128 13.039

NORB 0.054 0.074 0.093 0.161 0.239

PREC 0.797 1.628 1.809 3.158 4.966

SOLV 253.187 508.984 895.346 3254.107 5317.635

DTOH 0.005 0.008 0.008 0.015 0.022

Total 256.864 515.466 903.655 3268.717 5341.193

Table 6.12: Timing in seconds of each routine executed on a hybrid CPU/GPU platform with different nodal points for the 2D lid-driven cavity flow problem investigated at Re = 1000

M1 M2 M3 M4 M5

DATA 0.333 0.569 0.873 1.334 2.004

ECOL 0.217 0.373 0.609 0.921 1.879

HTOD 0.266 0.420 0.581 1.014 1.593

MATX 2.318 3.441 5.151 9.173 14.544

NORB 0.051 0.082 0.101 0.169 0.254

PREC 0.904 1.352 2.041 3.528 5.540

SOLV 754.137 1717.714 2410.355 5553.297 10965.770

DTOH 0.002 0.004 0.011 0.019 0.031

Total 758.228 1723.955 2419.722 5569.455 10991.615

Table 6.13: Timing in seconds of each routine executed on a hybrid CPU/GPU platform with different nodal points for the 2D lid-driven cavity flow problem investigated at Re = 5000

M1 M2 M3 M4 M5

DATA 11.882 31.178 92.289 550.847 2303.337

ECOL 1.232 13.587 76.736 837.616 4566.336

HTOD 1.729 5.620 7.703 43.606 70.603

MATX 22.585 75.138 167.243 625.483 1569.951

NORB 0.005 0.014 0.025 0.072 0.176

PREC 2.804 11.469 24.521 82.362 193.155

SOLV 176.580 700.007 2130.097 10348.068 33337.453

DTOH 0.002 0.007 0.023 0.070 0.110

Total 216.819 837.020 3498.637 11650.508 37474.785

Table 6.14: Timing in seconds of each routine executed on a hybrid CPU/GPU platform with different nodal points for the 3D lid-driven cavity flow problem investigated at Re = 400

M1 M2 M3 M4 M5

DATA 11.984 30.875 90.088 536.819 2356.922

ECOL 1.232 13.587 76.736 837.616 4566.336

HTOD 1.713 2.787 11.554 36.621 65.632

MATX 23.374 61.842 151.328 694.178 1757.023

NORB 0.007 0.010 0.024 0.065 0.132

PREC 3.207 0.292 24.351 73.607 169.439

SOLV 642.055 1969.218 5723.879 33861.239 71861.279

DTOH 0.004 0.006 0.021 0.072 0.098

Total 683.576 2078.671 6077.981 36040.217 80776.861 Table 6.15: Timing in seconds of each routine executed on a hybrid CPU/GPU platform with different nodal points for the 3D lid-driven cavity flow problem investigated at Re = 1000

M1 M2 M3 M4 M5

CPU 1365.057 2850.348 5320.078 14173.739 32047.025

CPU/GPU 256.864 515.466 903.655 2368.717 5341.193

Speedup 5.314 5.530 5.887 5.984 6.000

Table 6.16: Comparisons of the total computation time(s) and the speedup for the 2D lid-driven cavity flow problem investigated at Re = 1000

M1 M2 M3 M4 M5

CPU 4029.610 8250.572 14127.115 34027.644 66933.028

CPU/GPU 758.228 1423.955 2419.722 5569.455 10965.770

Speedup 5.315 5.794 5.838 6.110 6.104

Table 6.17: Comparisons of the total computation time(s) and the speedup for the 2D lid-driven cavity flow problem investigated at Re = 5000

M1 M2 M3 M4 M5

CPU 2366.283 9605.931 29731.901 155611.971 580101.103

CPU/GPU 216.819 837.020 2498.637 11650.508 37474.785

Speedup 10.914 11.476 11.899 13.356 15.479

Table 6.18: Comparisons of the total computation time(s) and the speedup for the 3D lid-driven cavity flow problem investigated at Re = 400

M1 M2 M3 M4 M5

CPU 8421.727 26894.400 79382.353 545168.143 1452458.737

CPU/GPU 683.576 2078.617 6077.981 36040.217 80776.861

Speedup 12.320 12.939 13.061 15.216 17.980

Table 6.19: Comparisons of the total computation time(s) and the speedup for the 3D lid-driven cavity flow problem investigated at Re = 1000

Figure 6.1: Comparison of the peak performance of the CPUs and the GPUs

Figure 6.2: Comparison of the memory bandwidth of the CPUs and the GPUs

Figure 6.3: Basic structure of the two investigated typical processors [94]: (a) CPU ; (b) GPU

Figure 6.4: Basic structure of the Kepler K20 architecture.

Figure 6.5: Schematic of the CUDA programming model.

D C

A B A

Figure 6.6: Illustration of the 2D mesh coloring strategy in four color-element groups without sharing the same global node.

X Y

G H

F B

E E

H A

C D

E F

Figure 6.7: Illustration of the 3D mesh coloring strategy in eight color-element groups without sharing the same global node.

Figure 6.8: Distribution of all element matrices in global memory to satisfy the global memory coalescing.

Figure 6.9: Extension of element matrix to satisfy the global memory coalescing condi-tion.

Figure 6.10: Illustration of the inner product operation on GPU

Figure 6.11: Flow chart for the developed code executed on a hybrid CPU/GPU platform.

U U. Ghia et al. [52], 256x256

2D Ghia et al. [52], 256x256

Figure 6.12: Comparison of the predicted mid-sectional velocity profiles for the lid-driven cavity problem considered in Sec. 6.8.3. (a)-(b) 2D problems ; (c)-(d) 3D problems.

Grids size

Speedupratio

5000 10000 15000 20000 25000

5.2

Figure 6.13: Plot of speedup ratio for the two- and three-dimensional lid-driven cavity problem at different grids size.

Chapter 7 Applications

In previous chapters, we focused on developing a new finite element model for the simulation of incompressible fluid flow and heat transfer problems. The justification of using the proposed finite element model has been demonstrated through the several an-alytical verification and benchmark validation problems. In this chapter, the proposed finite element model will be applied to investigate some practical flow problems. In order to obtain the results quickly, all the calculations will be executed on CPU/GPU platform.

7.1 Three-dimensional 90 bend curved duct flow prob-lem

Flow in curved ducts can be found in piping system, air conditioning, centrifugal pumps, aircraft intakes, river bends and cooling coils of heat exchanges. Their practical signifi-cance has motivated considerable research effort in the past. The destabilizing centrifugal and viscous (Tollmien-Schliichting) instabilities can be both present and they may fur-thermore interact with each other in the case of curved geometry. Due to nonlinear inter-action between two instability mechanisms, it is highly possible that fluid flows in bent ducts might proceed prematurely from transition to turbulence. Advancing our under-standing of the three-dimensional nature of the curved duct flow is, thus, of fundamental importance.

A main difference between curved and straight duct flows is the generation of sec-ondary flow within the elbow region. It was firstly observed by Eustice in curved pipe [107,108]. Later on, the experimentally observed phenomena were analytically confirmed

by Dean [109]. The secondary flow motion that makes the flow in curved ducts differ sig-nificantly from that seen in straight ducts. Secondary flow phenomena depends upon the Dean number, which represents the ratio of centrifugal force to the viscous force. Sec-ondary flow results in a pressure loss, the spatial redistribution of streamwise velocity and increased heat transfer with the duct. In addition, a much larger pressure drop, better mix-ing, and non-uniform wall shear stress are the major features of the fluid flow in curved ducts.

In the past decades, numerous studies of curved duct with rectangular or circle cross sections have been investigated. Due to the effects of geometry, the flow problems in rectangular curved ducts is more involved than those in the circle ducts. However, some useful insights are usually impossible to obtain in experiments. One way to obtain the secondary flow insight is to exploit the computational fluid dynamics technique. Study of this problem is particularly suitable to perform numerical simulation.

在文檔中發展一波數誤差最佳化有限元素 GPU 平行計算模型以求解不可壓縮 Navier-Stokes 方程式 (頁 146-166)