Algorithm of traversal for boundary handling with a stack

Implementation on GPUs

Algorithm 5 Algorithm of traversal for boundary handling with a stack

1: int stack[SIZE]

2: BVT = bounding volume of the target triangle

3: current node = the root of the BVH of a certain hole

4: while true do

5: if current node and BVT overlap then

6: if current node is a leaf then

7: The target triangle is violated.

8: else

9: current node = left child of current node

10: Push(stack, current node index + 1)

11: continue

12: end if

13: end if

14: if stack is empty then

15: break

16: end if

17: current node = Pop(stack)

18: end while

considered nodes are both leaf nodes. However, on GPUs, if we start to perform traversal from the roots of BVHs, the degree of parallelism is bad for the first few steps. For ex-ample, the first step of traversal is computed by only one thread, and the second step of traversal is computed by four threads if the roots overlap. So, we do not perform traversal in accordance with the way on CPUs. In general, traversal is performed according to each triangle in order to increase the degree of parallelism. In other words, traversal on GPUs is performed with a triangle-based manner. We can store potentially colliding pairs of each triangle individually in an array with respect to the triangle id. We have four policies to perform traversal on GPUs.

int THREAD = N;

int BLOCK = #Triangles/THREAD + 1;

gpu_traversal <<<BLOCK, THREAD, 0>>> (...);

The simplest method is to create one thread per triangle. So, totally nt threads are created, and N threads per block, where nt is the number of triangles. Traversal is per-formed for each thread, but threads responsible for positively oriented triangles do not execute any computation. In other words, the process just needs to perform traversal for negatively oriented and violated triangles. We employ stack to perform traversal. Be-sides, we store the bounding volume bounds of each triangle in shared memory to reduce global memory accesses. However, each thread needs to perform traversal from the root to leaves for each triangle, and there are a lot of overlap tests. For a single thread on GPUs, the computation ability is worse than CPUs, so the performance is not good.

int THREAD = N;

int BLOCK = #Triangles with negatively oriented and violated/THREAD + 1;

gpu_traversal <<<BLOCK, THREAD, 0>>> (...);

By the second method, We preprocess the triangle type list before performing traver-sal. We compress the triangle type list and collect the indices of negatively oriented and violated triangles. So, we get a compact array which records the indices of negatively oriented and violated triangles. After performing view tests and boundary handling, we get an array which indicates the types of all triangles. For example, Figure 5.4 shows the results, where positively oriented triangles are indicated by 0, negatively oriented trian-gles are indicated by 1, and violated triantrian-gles are indicated by 2. To compress the triangle type list can reduce the number of threads we need, and the workload of each thread is bal-anced. But some penalties are needed for compressing. After compressing, we create nnv threads to perform traversal in parallel, where nnv is the number of negatively oriented and violated triangles. For each triangle, it needs to be traversed to the entire BVH.

Figure 5.4: The triangle type list.

The steps of compression are similar to what we do in the step of boundary handling.

At first, we divide the triangle type list into nc chunks and 8 elements per chunk, and nc threads are created. Then, we need to compute the number of negatively oriented and violated triangles for each chunk and store the results in an array A. Next, we calculate the exclusive prefix sum for the array A and store the results in another array B. The elements in the array B indicate the start index in the compact array for each chunk. Finally, we store the indices of negatively oriented and violated triangles in a compact array according to the prefix sum in the array B, as shown in Figure 5.5. The elements in the compact array indicate the triangle indices, which are negatively oriented or violated.

The following two policies are based on the front-based decomposition algorithm presented by Tang et al. [TMT09]. There is a bounding volume test tree (BVTT) which contains all the overlap test pairs during traversal. Some nodes in BVTT are marked by front. Front nodes indicate the terminal points for traversing in the previous frame. In the next frame, traversal is performed from these front nodes instead of roots. If the quality

Figure 5.5: Preprocess the triangle type list before performing traversal.

of the BVTT front is over than a threshold, the BVTT front have to be rebuilt.

int THREAD = N;

int BLOCK = #Triangles with negatively oriented and violated/THREAD + 1;

gpu_traversal <<<BLOCK, THREAD, 0>>> (...);

By the third policy, nnv threads are created, where nnv is the number of negatively oriented and violated triangles. Each thread is responsible for traversing between the target triangle and the BVH of the original object. We do not build BVTT because our traversal is performed based on each triangle. We record a list of nodes according to each triangle each frame, that the nodes contain two types. One collides with the target triangle, which contains leaf nodes, and another does not collide with the target triangle, which contains internal nodes and leaf nodes. These nodes are called history nodes. The list is composed of 512× n elements, where n is the number of triangles. So, for each triangle, we allocate 512 spaces to store the history nodes. Initially, the list is composed of the root nodes of the BVH. It means that, for all triangles, traversal should be started from the root. After the initial stage, there is a list of pairs for each triangle. So, traversal is performed for the history nodes in the next frame. The algorithm is shown in Algorithm 6.

在文檔中基於視角的連續自我碰撞偵測以及在圖形處理器上之加速 (頁 74-78)