5.1 OpenMP Standard
Nowadays there are two major parallel programming models; one is openMP [6],
based on shared memory architecture, the other is MPI [23], or the message passing
interface. OpenMP has a huge advantage of keeping the sequential and parallel codes
identical. Directives are used to tell the compiler whether to parallel the region or not,
this preserves the original sequential codes. OpenMP is also portable; the codes can be
ported to various machines without modification. Here we choose openMP as our
parallel implementation.
5.2 Domain Partitioning in Serial
Our goal is to divide the scene into disjoint regions with roughly the same number
of primitives, and each region represents a subtree. Simply divide at primitive median
and we can have two sub-domains with roughly equal primitives.
At first one thread is in charge of the construction, the only difference is that we
simply use the event median rather than incremental sweeping SAH for the split
position. The construction is proceeded until a certain level, where each node of the
level will be the root of a subtree at the next stage. For example, if we need eight
subtrees, the construction is performed until level three is done.
The events of these nodes are moved to the event arrays of the subtrees.
Fig. 5-1 Domain partitioning in serial
5.3 Subtree Construction in Parallel
Let each subtree have its own node pool and event pool; we assign them to threads
and run the subtree construction in parallel.
#pragma omp parallel for
(int 0; ; )
( );
for sub sub numSubtree sub buildSubTree sub
= < + + (5.1)
According to the number “sub” it has been assigned, each thread knows which pools
to work on.
The subtree construction is totally the same as the regular construction we have
mentioned in the previous chapter.
5.4 Merging of Subtrees
After the subtree construction is done, every subtree has its own node pool and leaf
pool. As we would hope to maintain the same ray traversal codes, we must merge the
sub-arrays into a whole one.
Fig. 5-2 Merging of sub-arrays
5.5 Event Sorting on GPU
Since all events are sorted at root only once, and the ascending order is maintained
during the overall construction, much time can be saved if we can do the sorting in
parallel. For a model with 60k triangles, there are 120k events on all three axes, thus we
have to do a sorting of 120k elements 3 times. We use a bitonic merge sort on GPU for
the sorting; the details are discussed in Chapter 6.
Although GPUs afford extreme performance for the root event sorting, the domain
partition here is still performed in serial. Also it is unlikely to expect that every desktop
has a decent GPU card, let alone a laptop. The implementation differs from various
GPUs, and we know that NVIDIA is not the sole vendor in the market.
5.6 Domain Partitioning in Parallel Using 1D Binning
To achieve a higher scalability we must do the initial partitioning in parallel as well.
By intuition we would like to do the event classification in parallel, but this is not going
to work out. Multiple threads work together on the same event lists, this cause race
condition. Of course we can evenly distribute the events to threads, but in order to
maintain the ascending order additional merge sorts are inevitable.
Rather than partitioning the sorted events, we instead partition the triangles. We
pigeonhole all the triangles on one selected axis, this is called binning [27]. And by
accumulating the triangles one can decide at which bin the region is split. Finally we
will have regions with roughly equal number of triangles.
First we calculate the minimum and maximum of the whole scene on the three axes,
and find the axis with the longest extent to perform binning, this can be done in O(N). A
fixed number of bins are set on the selected axis. All triangles are evenly distributed to
threads, and all threads perform the min-max binning in parallel. And then we
increment one bin at a time to accumulate the triangles from the minimum to the
maximum, if the accumulated triangles are larger than the percentage (25% if 4 sub
regions), we mark the left boundary of the bin as one boundary of the sub region.
Fig. 5-3 Parallel 1D binning for domain partitioning
After all boundaries of the sub regions have been decided, we classify all triangles
into sub regions, also in parallel. Now all subtrees have roughly the same number of
triangles, and by sorting each subtree has its own sorted event list.
Not only are the triangles binned in parallel, but also the event sorting is done in
parallel. The scalability has improved drastically.
5.7 Domain Partitioning in Parallel Using 3D Binning
Binning on one dimension is easy, but may finally reach a “tipping point”. For a
partition of 8, 16 and up, the space becomes “slices of bread”. The quality of tree
degrades seriously, and we can see that the performance of ray traversal plunges.
A remedy to this is to perform the binning on all three dimensions. However,
another difficult arises as that we cannot bin all the triangles into its sub-region at one
step. Therefore the binning is factored into DFS or BFS fashion.
Fig. 5-4 Parallel 3D binning for domain partitioning
The 3D binning is slower, but better preserves the quality of tree. For a partition
fewer than 8 we still prefer 1D binning.