Triangle type determination - Use of Data on GPUs

Implementation on GPUs

5.2 Use of Data on GPUs

5.3.2 Triangle type determination

int THREAD = N;

int BLOCK = #Triangles/THREAD + 1;

gpu_faceTypeDetermination <<<BLOCK, THREAD, 0>>> (...);

The types of all triangles are determined according to their orientation related to the view primitive in [0, ∆t]. The GPU process and CPU process are the same. But one thread per triangle is created to execute the process. So, totally nt threads are created and N threads per block, where nt is the number of triangles.

For closed meshes, we divide all triangles into three view sets, The types of all tri-angles are computed every frame. For unclosed meshes, we divide all tritri-angles into four view sets. For a certain frame, view tests are performed for non-violated triangles to deter-mine the types that violated triangles are skipped. We keep the types of violated triangles until the model is self-collision free.

On the other hand, we do not employ triangle clusters to improve the performance on GPUs. The main improvement of employing triangle clusters is to reduce the execution time of vertex region determination. On CPUs, vertex region determination is performed sequentially according to each vertex, and the execution time of vertex region determina-tion is propordetermina-tional to the number of vertices. But on GPUs, vertex region determinadetermina-tion is performed in parallel for all vertices. The impact of the number of vertices on the perfor-mance of vertex region determination is much less on GPUs. So, we do not adopt triangle clusters to improve the performance.

5.4 Boundary Handling

For unclosed meshes, we need to perform boundary handling. We extract the trian-gles which collide with or pass through ghost triantrian-gles and record these triantrian-gles until the object is self-collision free.

int THREAD = N;

int BLOCK = #Triangles/THREAD +1;

gpu_boundaryHandling <<<BLOCK, THREAD, 0>>> (...);

Violated triangles will keep their types until the object is self-collision free. So, we do not perform boundary handling for violated triangles. There are two policies. The sim-plest method is to create one thread per triangle. So, totally nt threads are created, and N threads per block, where nt is the number of triangles. Boundary handling is performed for each thread, but threads responsible for violated triangles do not execute any compu-tation. In other words, the process just need to perform boundary handling and update the types for non-violated triangles.

int THREAD = N;

int BLOCK = #Triangles with non-violated/THREAD +1;

gpu_boundaryHandling <<<BLOCK, THREAD, 0>>> (...);

By another method, we will compress the triangle type list and collect the indices of non-violated triangles before performing boundary handling. So, we get a compact array which records the indices of non-violated triangles. One thread per non-violated triangle is created. So, totally nv threads are created, and N threads per block, where nv is the number of non-violated triangles. Compressing the triangle type list can reduce the number of threads, and the workload of each thread is balanced. But there are some penalties for compressing.

The steps of compression are described as follows. At first, we divide the triangle type list into nc chunks that 8 elements per chunk and nc threads are created. Then, we need to compute the number of non-violated triangles for each chunk and store the results in an array A. Note that the size of elements in A is nc. Next, we calculate the exclusive prefix sum for the array A and store the results in another array B. The elements in the array B indicate the start index in the compact array for each chunk. Finally, we store the indices of non-violated triangles in the compact array according to the prefix sum in the array B, as shown in Figure 5.3. The elements in the compact array indicate the triangle indices which are non-violated.

In the step of boundary handling, each thread is responsible for traversing between the target triangle and the BVHs of ghost triangles. For example, if there are nh holes in the deformable object, then the target triangle is traversed for these nh BVHs of ghost triangles sequentially. The process is ended as long as the target triangle collides with any ghost triangle. In fact, we determine whether or not the target triangle collides with a ghost triangle by their bounding boxes instead of triangles. If we compute collision detection by triangles exactly, the performance is bad. There are two reasons. On GPUs, it is good to perform simple and consistent computation in a kernel. Performing traversal with exact collision detection contains the computation of overlap tests for bounding volumes and elementary tests of triangles. It is more complicated, and the ability of parallel computing

Figure 5.3: Preprocess the triangle type list before performing boundary handling.

is reduced. Another reason is that there are a lot of global memory accesses. We need to access the data of bounding volumes and triangle information at the same time, and the data in the cache is not consistent. The number of violated triangles with inexact collision detection by bounding volumes is larger than exact collision detection by triangles. And then, in the step of traversal, the computation and the number of potentially colliding pairs are larger. But the entire performance is better.

On CPUs, the cache is larger and the computation ability of processors is better, so we perform boundary handling with exact collision detection by triangles.

We employ shared memory to store the data of bounding volumes of each triangle.

In the stage of traversal for each triangle, there are a lot of overlap tests, and the data of bounding volume of target triangles are used frequently. So, we employ shared memory to store the data, as shown in Algorithm 4. For each thread in a block, it copies the data from the global memory to shared memory itself in the beginning.

In the stage of traversal, we employ stack to perform traversal instead of recursion, as shown in Algorithm 4. Fermi GPU supports recursion, but the performance of recursion is slower. For a certain node of BVH, there are three cases.

• Case 1: There is no overlapping between the bounding volume of the target triangle and the node. The process is ended.

在文檔中基於視角的連續自我碰撞偵測以及在圖形處理器上之加速 (頁 69-73)