Grids LSM

This strategy assigns one thread to each row-block in GSM. Thus the number of workings threads equals to the number of row-blocks, and this is much more than the number of elements in LSM. In this strategy, each thread needs to scan through all elements in of corresponding row in LSM. Many threads may simply quit because there is no corresponding elements in the corresponding row in LSM. Threads are organized one-dimensionally into thread blocks. Thread blocks are also arranged one-dimensionally onto the grid. This is illustrated in Figure 3.

part of row-blocks

Dofs =10

Dofs= 10

LSM

element

thread block thread

1D Grids

row-block handle same index of row

Figure 3. Illustrated the idea of strategy 2 for CUDA CRABS.

Dofs =10

Dofs= 10

element thread block thread

3.3 Strategy three

Strategy three improves upon strategy two by using a different thread configuration. This strategy organizes threads into three-dimensional thread blocks. This is illustrated in Figure 4. Three-dimensional thread blocks help expose more parallelism. In the first dimension of a thread block, different threads work on different row-blocks in a row. In the second dimension of a thread block, different threads work on different rows of GSM. Finally, threads in the third dimension of a thread block work on different elements within a row-block. The three-dimensional thread blocks are then laid out on a two-dimensional grid to cover all row-blocks in GSM.

Figure 4. Illustrated 3D thread block for strategy 3.

4 Effectiveness evaluation of different strategies

Effectiveness of the developed strategies was evaluated using the computing environment summarized in Table 2. Before our evaluations, the computer codes were first validated. This was done by computing benchmark problems in 2D and 3D. The benchmark problems were plane-stress cantilever beam problem and point-load problem. These problems were chosen because they have analytical solutions, which can be used to validate the computation results. The computed solutions all agreed very well with analytical solutions.

Once the computer codes were validated, evaluations were conducted. These evaluations involved benchmark problems of various problem sizes. This was done in order to know the effect of problem sizes on the archived speedups or slowdowns. Also, pure CPU computations were done so that speedups could be computed. In the following, the speedup is defined by the total computation time of pure CPU computation divided by the total computation time using hybrid computation. The following sections show our benchmark problems and achieved speedups.

Table 2. Computational environments.

CPU Intel Core i7 950 @ 3.07 GHz Accelerator NVidia Tesla C2050

Memory 24 GB

Operation System Gentoo Linux 64 bit Compiler Intel C/C⁺⁺ Compiler 11.1 Accelerator SDK CUDA SDK 3.2

Solver Intel MKL, PARDISO 10.3.6

4.1 Evaluations in 2D

Figure 5 shows the 2D plane-stress cantilever beam benchmark problem. The beam subjects to a concentrated load at the free-end with P=1000 N. Depth of the beam, D, is 12m and length of the beam, L, is 48 m. The material of the beam is assumed to be linearly elastic, and has Young’s modulus E=3x10⁷N/m² and Poisson’s ratio v=0.3. Nodes and integrated cells are uniformly

X Y

Z 3D Thread Block

thread block thread

distributed within the problem domain with three selected cases listed in Table 3. Each integration cell has 25 quadrature points (5x5).

Various problem sizes evaluated are given in Table 3, and speedups of different strategies are shown in Figure 6. Figure 6(a) shows results for CRABS with E=0. In such setting, it is fast to access matrix elements, but requires more memory for storing GSM. It is seen strategy 1 and 3 gives almost identical speedups between 1.25 – 1.35. Meanwhile, strategy 2 has a speedup of 0.1, meaning hybrid computing is ten times slower than the pure CPU computation.

Figure 6(b) shows results for CRABS with E=1.0. In such setting, it is necessary to scan through row-blocks in order to add a non-zero element into GSM, making GSM assemblage slow. It is seen in Figure 6(b) that hybrid computing speeds up such process and achieves a speedup of 1.8 – 2.2. Also, strategy 3 performs better than strategy 1, while strategy 2 still slows down the GSM assemblage.

Figure 5. Cantilever beam model.

Table 3. Three selected cases of the nodes and integration cells distributed in 2D model.

a) b) c) d) e) f) g)

1

101 29 5858 40 16 3200

2

201 57 22914 80 32 12800

3

301 85 51170 160 48 28800

* a) cases, b) number of nodes in X direction, c) number of nodes in Y direction, d) number of DOFs, e) number of cells in X direction, f) number of cells in Y direction, g) total number of quadrature points.

Figure 6. The speedup of CUDA CRABS in 2D cases.

X Y

P = 1000 N D = 12 m

L = 48 m

4.2 Evaluations in 3D

Figure 7 shows a 3D point-load in elastic half-space problem. Roller boundary conditions are applied on all sides except the top surface. The elastic medium has Young’s modulus of 3.0x10⁷ N/m² and Poisson’s ratio of 0.3. The analysing domain uses dimensions of 1m by 1m by 1m. The point load is applied at the corner of the cubic domain assumed with a magnitude of 1N. Nodes and integration cells are uniformly distributed within the problem domain with three selected cases listed in Table 4.

Each integration cell has 125 quadrature points (5x5x5).

Figure 8(a) shows speedups using CRABS with E=0.0. Similar to 2D results, strategy 1 and 3 give identical speedups, ranging between 3.0 and 3.5, while strategy 2 slows down the analyses.

Comparing to 2D results, though, that 3D analyses have higher speedups than in 2D. This is because the GSM in 3D tends to be larger and denser than ones in 2D. There are more non-zero elements to be added to GSM for 3D analyses than for 2D analyses.

Figure 8(b) shows speedups using CRABS with E=1.0. Again, strategy 1 and 3 give favourable speedups. It is seen in Figure 8(b) that strategy 1 outperforms strategy 3 in this setting, and a significant speedup of 12.0 is achieved using strategy 1. Strategy 3 also managed to achieve a speedup of 8 for the largest case. It is also seen in Figure 8 that speedup increases with increasing problem sizes.

Figure 7. Cantilever beam model.

Table 4. Three selected cases of the nodes and integration cells distributed in 3D model.

a) b) c) d) e) f) g) h) i)

1

7 7 7 1029 7 7 7 1715

2

9 9 9 2187 9 9 9 3645

3

10 10 10 3000 10 10 10 5000

* a) cases, b) number of nodes in X direction, c) number of nodes in Y direction, d) number of nodes in Z direction, e) number of DOFs, f) number of cells in X direction, g) number of cells in Y direction, h) number of cells in Z direction, i) number of quadrature points.

P = 1 N

L = 1 m

W = 1 m

X

Z Y

Figure 8. The speedup of CUDA CRABS in 3D cases.

5 Conclusive summary

In this work, GSM assemblage is accelerated by using hybrid computing technique with success.

Focusing on GSM assemblage allows small modifications to existing code. Three strategies were devised for using hybrid computing technique to speedup assemblage of GSM. Overall, strategy 1 performed best, and achieved a speedup of 12.0 for large 3D meshfree analyses. The speedup increases with increasing problem size. It is possible further speedup can be achieved for even larger problems.

Acknowledgements

This study was supported by grant from National Science Council of Taiwan (NSC 98-2221-E-011- 112-MY2).

References

BATHE, K.J. and De, S., 2001. The method of finite spheres with improved numerical integration. Computes and Structures , 79, 2183-2196.

CHEN, J.S., PAN, C. and WU, C.T., 1997. Large deformation analysis of rubber based on a reproducing kernel particle method, Computational Mechanics, 19, 211-227.

JENNING, A., 1966. A compact storage scheme for the solution of symmetric linear simultaneous equations. Computer Journal, 9, 281-285.

LI, S. and LIU, W.K., 2002. Meshfree and particle methods and their applications. ASME Applied Mechanics Review, 54, 1-34.

PAN, M.S. AND HSIEH, Y.M., 2012. The Evaluation of Sparse Matrix Performance in Meshfree Analysis. 14th International Conference on Computing in Civil and Building Engineering, Moscow, Russia.

SHERMAN, A.H., 1975. On the efficient solution of sparse systems of linear and nonlinear equations. Doctoral Dissertation, New Haven, CT, USA: Yale University.

Abstract

This paper discusses how memory usage and performance in meshfree analyses are affected by various data structures of sparse matrices and algorithms. In meshfree analysis, the global stiffness matrix is sparse, and it is necessary making use of this characteristic to make analyses efficient. This means reducing the simulation time and/or the use of memory. Sparse matrices have been extensively discussed in the context of the finite element method. It is rarely studied in meshfree method. This paper shows the sparse matrix format is very important in meshfree analyses. Both memory consumption and performance are demonstrated in 2D and 3D meshfree analyses using various sparse matrix formats. The results show that 98% of computation time can be attributed to assembling global stiffness matrix in extreme conditions.

在文檔中應用雲端運算技術於建構隧道檢監測資訊系統之研析 (頁 32-37)