Parallel Construction on Multi-Core CPU - 光線追蹤法之平行化加速結構於多核處理器

5.1 OpenMP Standard

Nowadays there are two major parallel programming models; one is openMP [6],

based on shared memory architecture, the other is MPI [23], or the message passing

interface. OpenMP has a huge advantage of keeping the sequential and parallel codes

identical. Directives are used to tell the compiler whether to parallel the region or not,

this preserves the original sequential codes. OpenMP is also portable; the codes can be

ported to various machines without modification. Here we choose openMP as our

parallel implementation.

5.2 Domain Partitioning in Serial

Our goal is to divide the scene into disjoint regions with roughly the same number

of primitives, and each region represents a subtree. Simply divide at primitive median

and we can have two sub-domains with roughly equal primitives.

At first one thread is in charge of the construction, the only difference is that we

simply use the event median rather than incremental sweeping SAH for the split

position. The construction is proceeded until a certain level, where each node of the

level will be the root of a subtree at the next stage. For example, if we need eight

subtrees, the construction is performed until level three is done.

The events of these nodes are moved to the event arrays of the subtrees.

Fig. 5-1 Domain partitioning in serial

5.3 Subtree Construction in Parallel

Let each subtree have its own node pool and event pool; we assign them to threads

and run the subtree construction in parallel.

#pragma omp parallel for

(int 0; ; )

( );

for sub sub numSubtree sub buildSubTree sub

= < + + (5.1)

According to the number “sub” it has been assigned, each thread knows which pools

to work on.

The subtree construction is totally the same as the regular construction we have

mentioned in the previous chapter.

5.4 Merging of Subtrees

After the subtree construction is done, every subtree has its own node pool and leaf

pool. As we would hope to maintain the same ray traversal codes, we must merge the

sub-arrays into a whole one.

Fig. 5-2 Merging of sub-arrays

5.5 Event Sorting on GPU

Since all events are sorted at root only once, and the ascending order is maintained

during the overall construction, much time can be saved if we can do the sorting in

parallel. For a model with 60k triangles, there are 120k events on all three axes, thus we

have to do a sorting of 120k elements 3 times. We use a bitonic merge sort on GPU for

the sorting; the details are discussed in Chapter 6.

Although GPUs afford extreme performance for the root event sorting, the domain

partition here is still performed in serial. Also it is unlikely to expect that every desktop

has a decent GPU card, let alone a laptop. The implementation differs from various

GPUs, and we know that NVIDIA is not the sole vendor in the market.

5.6 Domain Partitioning in Parallel Using 1D Binning

To achieve a higher scalability we must do the initial partitioning in parallel as well.

By intuition we would like to do the event classification in parallel, but this is not going

to work out. Multiple threads work together on the same event lists, this cause race

condition. Of course we can evenly distribute the events to threads, but in order to

maintain the ascending order additional merge sorts are inevitable.

Rather than partitioning the sorted events, we instead partition the triangles. We

pigeonhole all the triangles on one selected axis, this is called binning [27]. And by

accumulating the triangles one can decide at which bin the region is split. Finally we

will have regions with roughly equal number of triangles.

First we calculate the minimum and maximum of the whole scene on the three axes,

and find the axis with the longest extent to perform binning, this can be done in O(N). A

fixed number of bins are set on the selected axis. All triangles are evenly distributed to

threads, and all threads perform the min-max binning in parallel. And then we

increment one bin at a time to accumulate the triangles from the minimum to the

maximum, if the accumulated triangles are larger than the percentage (25% if 4 sub

regions), we mark the left boundary of the bin as one boundary of the sub region.

Fig. 5-3 Parallel 1D binning for domain partitioning

After all boundaries of the sub regions have been decided, we classify all triangles

into sub regions, also in parallel. Now all subtrees have roughly the same number of

triangles, and by sorting each subtree has its own sorted event list.

Not only are the triangles binned in parallel, but also the event sorting is done in

parallel. The scalability has improved drastically.

5.7 Domain Partitioning in Parallel Using 3D Binning

Binning on one dimension is easy, but may finally reach a “tipping point”. For a

partition of 8, 16 and up, the space becomes “slices of bread”. The quality of tree

degrades seriously, and we can see that the performance of ray traversal plunges.

A remedy to this is to perform the binning on all three dimensions. However,

another difficult arises as that we cannot bin all the triangles into its sub-region at one

step. Therefore the binning is factored into DFS or BFS fashion.

Fig. 5-4 Parallel 3D binning for domain partitioning

The 3D binning is slower, but better preserves the quality of tree. For a partition

fewer than 8 we still prefer 1D binning.

在文檔中光線追蹤法之平行化加速結構於多核處理器 (頁 39-45)