Department of Construction Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, Republic of China

solving linear system of equations takes less time than assembling GSM. Basing on these observations, we focus on accelerating GSM assemblage with advanced data structures using hybrid computing technique.

Hybrid computing (or heterogeneous computing) refers to the use of more than one type of computational units to perform computing tasks. It is a special form of parallel computing, and it is rather challenging to use. It is challenging partly because one needs to have good understandings on the strength and weakness of each type of computational units in order to make the best use of them.

The other reason is the need of using different programming models than conventional ones. With proper collaboration between these units, significant speedups can be achieved. On the other hand, the communication between these units can sometimes outweigh the benefits, making the overall computation slow. In this paper, we share our experience on using such technique for speeding up meshfree analyses.

The remaining parts of this paper are organized as the following. We first introduce briefly our data structure for storing sparse matrices. Then, three strategies for accelerating GSM assemblage are described. Their effectiveness is then evaluated and demonstrated by two benchmark problems of various sizes. Finally, we summarize our findings and experiences.

2 The sparse matrix data structure

The data structure for storing and assembling GSM in this work is our home-grown data structure called CRABS. CRABS stands compressed row adaptive block storage. In CRABS, each row in a sparse matrix is stored using one or many blocks, which are called row-blocks. Each row-block is essentially a one-dimensional array for storing continuous matrix elements. Furthermore, matrix elements of value zero are allowed to be stored in row-blocks. The row-block is illustrated in Figure 1. Each row in a sparse matrix may be formed by different row-block patterns. Therefore, CRABS essentially store sparse matrices by using multiple row-blocks. In this paper, we only give conceptual descriptions of CRABS, and detailed data structure is left out for space reasons.

The row-blocks structure for storing a sparse matrix needs to be formed beforehand. This is necessary to avoid overheads associated with memory allocations. Nodal support domain information is used to derive the non-zero pattern of GSM. This non-zero pattern is then scanned to determine the row-block structure. This scanning process uses a parameter E to control the memory efficiency of the created structure. E is in the range of 0.0 and 1.0. The effect is discussed in the next two paragraphs.

When E equals 0.0, each row has exactly one row-block to store non-zero elements in a row. As a result, the row-block will contain some zero elements unless all non-zero elements are adjacent to each other. In this extreme, CRABS behaves like the skyline sparse matrix storage (Jennings, 1996).

It is very easy to find the proper location to store an element. All it needs is simple arithmetic operations (a subtraction and an addition to be exact). Therefore, this extreme allows fast matrix element access. But this fast access comes at the cost of large memory space consumption.

When E equals 1.0, the scanning algorithm we developed allows no non-zero matrix elements to be stored. As a result, CRABS achieved the most compact storage for sparse matrices. But each row likely contains many row-blocks. We could prove that in general, the data structure requires two third of the memory space if the identical matrix was stored using CRS format. This compact storage, however, comes at the cost of element access speed. At this extreme, row-blocks in a row need to be scanned one-by-one to identify which row-block hosts the element to be accessed. And there could be many row-blocks. As a result, element access at this extreme is slow. This cost, however, is less than the one in CRS.

In summary, we use CRABS for storing GSM for its flexibility. The parameter E in CRABS help us control the trade-offs between the needed memory spaces vs. the element access speed. As a result,

we need only one sparse matrix data structure and we can control our code to use more memory for efficiency or sacrifice performance for solving large problems.

Figure 1. illustrated row-blocks on CRABS.

3 Strategies to speedup GSM assemblage using hybrid computing technique

In this work, Intel CPU and NVidia accelerator (Tesla C2050) are used simultaneously to perform meshfree analyses. Intel CPU is a good general purpose processor, and is responsible for most tasks in meshfree analyses. These tasks include finding neighbouring nodes, compute shape functions, gauss integration, solving linear system of equations, etc. Tesla C2050 is solely responsible for assembling GSM using CRABS. This choice is made because we want to utilize the high memory bandwidth available on Tesla C2050 device. This high memory bandwidth can only be achieved with proper memory access patterns and multiple threads. Such requirement limits our choice on which part of meshfree analyses can be ported onto this device. During the GSM assemblage, there is a high demand of memory bandwidth in order to find the proper location to store each element in LSM.

Each element is LSM is independent from others, and therefore this task is highly parallel. For a LSM of size 16x16, there are 256 elements that can be searched simultaneously and stored into GSM.

Typical meshfree methods have LSM much larger than 16x16, and therefore it is suitable to offload GSM assemblage from CPU onto Tesla C2050. To make the hybrid computing efficient, it is necessary to let CPU and Tesla C2050 do computational work at the same time. We achieve this during GSM assemblage. CPU is used to compute LSM, which is a small and dense matrix. Once LSM is calculated, it is transferred asynchronously to Tesla C2050. At the same time, CPU will compute the next LSM. Therefore, during GSM assemblage, both CPU and Tesla C2050 are busy.

One is busy computing LSM, and the other one is busy adding LSM to GSM. This also makes CPU cache, which is relatively small, efficient, because it no longer need to store data for GSM. This design, as shown later in the paper, is proper and can achieve good speedup over our code implemented fully on CPU.

Three strategies on Tesla C2050, or simply called the device, were devised in order to do hybrid computation. To use the device, it needs to use some toolkits. There are two choices: NVidia CUDA and OpenCL. NVidia CUDA has reached maturity and mainstream, while OpenCL is relatively young.

We used NVidia CUDA in our implementation. Although OpenCL and CUDA are two different standards, their concepts are rather similar. We devised three different strategies to use the device for

0 1 2 3 4 5 6 7 9

(a) assumed a row vector of sparse matrix

0 1 2 3 4 5 6 7

(b) CRABS E=0

0 1 2 5 7

Block 1 Block 2 Block 0

assembling GSM. The main differences between these strategies are: 1) how the device threads are organized into thread blocks, and 2) how thread blocks are launched. Their tasks on the device are quite similar. The device is responsible to add each of the LSM elements onto the appropriate location in GSM. This involves searching the right row-block to update. The followings introduce these three strategies.

3.1 Strategy one

Strategy one parallelizes the task by assigning one thread to each of element in LSM. Each thread needs to 1) identify the position of the element in GSM, 2) scan through row-blocks in the corresponding row in the GSM, and 3) store the element into the correct row-block. The threads are organized into one-dimensional thread blocks. The thread blocks are then arranged onto a two-dimensional grid. The number of thread blocks is determined by the size of LSM. Figure 2 illustrates a LSM of size 10x10. Assuming each thread block has four threads, then number of thread blocks is given 3x10.

Figure 2. Illustrated the idea of strategy 1 for CUDA CRABS.

3.2 Strategy two

In the strategy one, all threads need to scan through multiple row-blocks, and only limited amount of parallelism is exposed. The maximum number of working threads equals to number of elements in

在文檔中應用雲端運算技術於建構隧道檢監測資訊系統之研析 (頁 29-32)