Algorithm of traversing with history nodes

Implementation on GPUs

Algorithm 6 Algorithm of traversing with history nodes

1: int stack[SIZE]

2: int round = the number of history nodes of the target triangle

3: BVT = bounding volume of the target triangle

4: for i = 0 to round-1 do

5: current node = the related element in the history list

6: if current node and BVT do not overlap then

7: Push(HistoryList, current node index)

8: continue

9: end if

10: if current node is a leaf then

11: Push(PCPList, the index of the triangle within current node)

12: Push(HistoryList, current node index)

13: continue

14: end if

15: current node = left child of current node

16: Push(stack, current node index + 1)

17: while true do

18: if current node and BVTT overlap then

19: if current node is a leaf then

20: Push(PCPList, the index of the triangle within current node)

21: Push(HistoryList, current node index)

22: else

23: current node = left child of current node

24: Push(stack, current node index + 1)

25: continue

Figure 5.6: Examples for history lists.

Figure 5.7: History list with many redundant nodes and reasonable nodes.

Figure 5.6(a) shows that the initial BVH. We record the root node for the target tri-angle in the history list. In Figure 5.6(b), traversal between the target node and the initial BVH is terminated at the nodes which are indicated by orange and blue color. The bound-ing volume of the target triangle collide with the blue nodes, E and F, and does not collide with the orange nodes, b and g. We record the nodes b, g, E, and F in the history list for the target triangle. In the next frame, the target triangle starts to perform traversal from nodes b, g, E, and F sequentially instead of the root, a. The performance is better than the previous methods. There are two reasons. The first reason is that, for a target triangle, the average number of performing overlap tests when traversing from its history nodes is less than traversing from the root. The second reason is that the colliding results of a target triangle is probably similar to the previous frame. So, we can improve the performance of performing traversal using history nodes. But as time goes by, history nodes would become inappropriate for performing traversal, and the number of history nodes for each triangle would increase. So, there are a lot of overlap tests, including unnecessary tests.

For example, the history nodes may contain two sibling nodes which both not overlap with the target triangle. We should perform overlap tests for their parent instead of two sibling nodes, as shown in Figure 5.7. We can improve this problem by the following solution.

When the number of history nodes increases over than a threshold, we can restart to tra-verse from the root node instead of the history nodes in the next frame.

int THREAD = N;

int BLOCK = #Triangles with negatively oriented and violated/THREAD + 1;

gpu_traversal <<<BLOCK, THREAD, 0>>> (...);

For the fourth policy, we need to preprocess traversal for each triangle and record the colliding situation between the initial BVH and each triangle. In fact, in the initial case, there is no self-collision between all triangles except for adjacent triangles. We want to record the initial history nodes and start to perform traversing from these nodes instead of the root node. We can observe that the number of performing overlap tests from the initial

history nodes is less than the root node. Besides, there is one more condition to discard the history nodes. Traversal is performed from the initial history nodes for every few frames.

The performance is the best for the fourth policy.

So, there are two conditions that history nodes should be released.

1. When the number of history nodes is too large over a threshold.

2. For every few frames.

5.6 Elementary Tests

After performing traversal, we can get a set of potentially colliding pairs (PCPs) for each triangle. PCPs based on each triangle are stored in an array which consists of 16×nt elements, where nt is the number of triangles. So, for each triangle, we allocate 16 spaces to store PCPs. We perform elementary tests of the PCPs with two different policies.

int THREAD = N;

int BLOCK = #Triangles/THREAD + 1;

gpu_elementaryTest <<<BLOCK, THREAD, 0>>> (...);

Because we store the PCPs in an array according to each triangle, nt threads are cre-ated to perform elementary tests, where nt is the number of triangles. Each thread is responsible for computation of a set of PCPs belongs to a triangle. There is a problem that there may be 0 to 16 PCPs for a triangle. The workload of each thread is different. So we sort the array which contains the data of the number of PCPs for each triangle, and assign work to threads according to the number of PCPs to get the better workload balance. The performance is improved, but some penalties are needed for sorting.

int THREAD = N;

int BLOCK = #PCPs/THREAD + 1;

gpu_elementaryTest <<<BLOCK, THREAD, 0>>>(...);

By another way, we create np threads to perform elementary tests, where np is the number of PCPs. So, each thread is responsible for one PCP. We need to preprocess the PCPs array before performing elementary tests. Initially, the PCPs are distributed in an array based on each triangle, and we have the data of the number of PCPs for each triangle in an array, called numPCPs array. The exclusive prefix sum of numPCPs array is com-puted. Then, we can employ nt threads to collect the PCPs for each triangle in parallel, where nt is the number of triangles. Finally, we get a compact array consists of the PCPs without redundant space, as shown in Figure 5.8.

Figure 5.8: PCPs array preprocessing.

We use two methods of elementary tests on GPUs, including Interval Newton and Cubic Solver [TCYM09]. On CPUs, the process of Interval Newton is recursive, and recursive functions are supported by Fermi GPUs. All the processes on CPUs and on GPUs are the same. Additionally, for Cubic Solver, we need to solve the cubic function exactly, and all the processes on CPUs and on GPUs are the same. Besides, we employ feature assignments [WB06] to reduce the computation of elementary tests.

Chapter 6

在文檔中基於視角的連續自我碰撞偵測以及在圖形處理器上之加速 (頁 78-83)