Parallel Range Query Algorithm - Parallelize the MMOG Server

Chapter 3 Parallelize the MMOG Server

3.5. Parallel Range Query Algorithm

After those conflict-free update vectors are computed for each client commands, we need to update the virtual world by range search, which is the most time-consuming problem overall. From history, we see that the range search problem has been extensively studied for more than twenty years. Many spatial partitioning methods are proposed to accelerate the neighbor search, such as the quad/oct-tree, range tree, R*-tree…, etc. Most of them are based on pointer to build up the structure of the tree. However this can hardly be done on current GPU due to the limited instruction set and fixed memory model. In [18], Paul uses well-separated pair decomposition to make parallel all-nearest-neighbors search optimal. Since it requires recursion during tree construction and pair computation, it is not feasible on current GPU even the CUDA is used. As a result, we derive our parallel range query algorithm which can be efficiently executed on massively parallel processors with respect to CUDA constraints.

We first assume the visibility range is fixed among all players to simplify the problem, although the limitation can be relaxed by appending a parallel filtering function in the end of the query. Once the range is fixed, we can disperse players into a 2D or 3D grid according to their position in Cartesian space. Grid can be seen as a list of indexed buckets, which is defined as square/cube with edge length equal to the fixed visibility range. This is just the well-known quad/oct-tree data structure.

However, since pointer can hardly be implemented and the number of the players in each bucket varies, we cannot store the grid directly on GPU. For example, a straightforward approach is to define the maximum number of players per bucket and reserve all space for every bucket in the grid. Nevertheless, this is a waste in memory, and if the grid size grows larger, GPU will definitely run out of memory in the end

because the GPU memory is relatively small to current CPU¹. To save the memory from wasting while preserving the efficiency of range query on GPU, we re-designed the data representation and search algorithm, which will be explained as follows.

We still rely on the bucket concept in quad/oct-tree to assort all the players. But this time, we store all players’ reference in a continuous array sorted by their bucket indices. This can be efficiently done on GPU by parallel load balanced radix sort.

Following that representation, we need to perform range query for neighbors of each player whose state is modified by the game logic. Inspired by discrete method [19] by Takahiro, we defined bucket as square/cube with edge length equal to the visibility range, and for each update vector, we only need to enumerate all players in adjacent buckets as in Fig 3-9. So finally, we have the affected bucket list composed by pairs in the form of {update vector index, bucket index}.

Fig 3-9 Multiple update ranges in a grid

1 In our settings with latest NVIDIA 8800GTX, we only have 768MB GPU memory.

As we stored all players’ reference in a continuous way, there are no direct indices can be used to find out who are the players in the specific bucket. Here we employed the parallel binary search to identify the range of specific bucket, that is, we perform binary search for multiple target keys in the same time. Also, to reduce the number of search times, we extract distinct bucket indices from the affected bucket list by the similar way in resolving update conflicts. The algorithms are illustrated as Fig 3-10, Fig 3-11, and Fig 3-12.

Once we have all needed bucket ranges, we can enumerate all state updates for all adjacent buckets in parallel. Note that before the enumeration, we have to count the number of possible state updates to make sure all updates write to correct memory location in a continuous way. Details about the algorithm can be found in Fig 3-13 to Fig 3-16.

Algorithm ParallelRangeQuery_WriteAffectedBuckets (plist, np, ulist, nu, blist) ; Input: ulist (an array of update vectors of size nu).

Output: blist (an array of bucket update vector of size nu).

begin

declare shared update_vector shm[] ; declare integer bs := 256 ;

declare integer gs := ceil(nu/bs) ; for bid:=0 to gs-1 do in parallel

for tid:=0 to bs-1 do in parallel

declare integer global_tid := bid*bs + tid ; if global_tid < nu then

declare update_vector upd := ulist[global_tid] ; declare float2 pos := plist[upd.playerID].position ;

declare int bucket_idx := POSITION_TO_BUCKET_IDX(pos) ; declare integer base := global_tid*9;

declare bucket_update_vector bupd;

bupd.updateID := global_tid;

bupd.bucketID := bucket_idx;

blist[base+0] := bupd;

bupd.bucketID := bucket_idx-1;

blist[base+1] := bupd;

bupd.bucketID := bucket_idx+1;

blist[base+2] := bupd;

bupd.bucketID := bucket_idx-ROW_SIZE;

blist[base+3] := bupd;

bupd.bucketID := bucket_idx-ROW_SIZE-1;

blist[base+4] := bupd;

bupd.bucketID := bucket_idx-ROW_SIZE+1;

blist[base+5] := bupd;

bupd.bucketID := bucket_idx+ROW_SIZE;

blist[base+6] := bupd;

bupd.bucketID := bucket_idx+ROW_SIZE-1;

blist[base+7] := bupd;

bupd.bucketID := bucket_idx+ROW_SIZE+1;

blist[base+8] := bupd;

end

Fig 3-10 Write affected bucket indices for the parallel range query algorithm

Algorithm ParallelRangeQuery_MarkSeperation (ulist, nu, seplist) ; Input: ulist (an array of update vectors of size nu).

Output: seplist (an array of indices of size nu).

begin

declare shared update_vector shm[] ; declare integer bs := 256 ;

declare integer gs := ceil(nu/bs) ;

for bid:=0 to gs-1 do in parallel for tid:=0 to bs-1 do in parallel

declare integer global_tid := bid*bs + tid ; if global_tid < nr then

shm[tid] := ulist[tid] ; declare update_vector tl, tr ; synchronize_threads() ; if global_tid < nu then

tl := shm[tid] ;

if tl.playerID = tr.playerID && tl.chunkID = tr.chunkID then mark := 1 ;

synchronize_threads() ;

if global_tid < nu then

seplist[global_tid] := mark ; end

Fig 3-11 Mark separation for the parallel range query algorithm

Algorithm ParallelRangeQuery_StoreSeperation (seplist, silist, ns, dblist) ;

Input: seplist (an array of size ns containing either 0 or 1 for the mark of seperation), silist (an array of indices as memory location references of size ns).

Output: dblist (an array of indices of size nu to store the distinct indices).

begin

declare integer bs := 256 ; declare integer gs := ceil(nu/bs) ;

for bid:=0 to gs-1 do in parallel for tid:=0 to bs-1 do in parallel

declare integer global_tid := bid*bs + tid ; declare integer mark := 0 ;

declare integer reference_address := 0 ; if global_tid < ns then

mark := seplist[global_tid] ; synchronize_threads() ;

if global_tid < ns && mark > 0 then

reference_address := silist[global_tid] ; synchronize_threads() ;

if global_tid < ns && mark > 0 then

mergelist[reference_address] := global_tid ; end

Fig 3-12 Store separation indices for the parallel range query algorithm

Algorithm ParallelRangeQuery_BinarySearch (plist_sort, np, dblist, nb, dbilist) ; Input: plist_sort (a sorted array of player bucket vector of size np), dblist (an array

of distinct bucket indices of size nb).

Output: dbrlist (an array of bucket ranges of size nb to store the start/end indices of a specific bucket).

begin

declare integer bs := 256 ; declare integer gs := ceil(nu/bs) ;

for bid:=0 to gs-1 do in parallel for tid:=0 to bs-1 do in parallel

declare integer global_tid := bid*bs + tid ; if global_tid < nb then

declare player_bucket pl ; declare player_bucket pr ; declare bucket_range range ;

declare integer bucket_idx := dblist[global_tid] ; declare integer rl := np / 2 - 1 ;

if pl = bucket_idx or pr = bucket_idx then found := 1 ;

if pl.bucket <= bucket_idx then rl := rl + level ;

else rl := rl - level ;

if pr.bucket <= bucket_idx then

if found = 0 then rr := rr + level ;

Fig 3-13 The two-way binary search in the parallel range query algorithm

Algorithm ParallelRangeQuery_CountUpdates (

blist, nb, ulist, nu, silist, nsi, dbrlist, ndbr, nclist) ; Input: blist (a array of bucket update vector of size nb), ulist (an array of update

vectors of size nu), silist (an array of indices for distinct bucket reference of size nsi), dbrlist (an array of bucket ranges of size ndbr).

Output: nclist (an array of size nb to number of update pairs of the bucket update vector).

begin

declare integer bs := 256 ; declare integer gs := ceil(nu/bs) ;

for bid:=0 to gs-1 do in parallel for tid:=0 to bs-1 do in parallel

declare integer global_tid := bid*bs + tid ; if global_tid < nb then

declare bucket_update_vector bupd := blist[global_tid] ;

declare update_vector upd := ulist[bupd.updateID] ; declare integer si_idx := silist[bupd.bucketID] ; declare bucket_range range := dbrlist[si_idx] ; nclist[global_tid] := range.right - range.left ; end

Fig 3-14 Count the number of updates for the parallel range query algorithm

Algorithm ParallelRangeQuery_EnumUpdates (list, nb, ulist, nu, silist, nsi, dbrlist, ndbr, nclist_scan, nncs, plist, plist_sort, np, nlist, max_nn) ; Input: blist (a array of bucket update vector of size nb), ulist (an array of update

vectors of size nu), silist (an array of indices for distinct bucket reference of size nsi), dbrlist (an array of bucket ranges of size ndbr), nclist_scan (an array

of scanned indices of uclist of size nncs), plist (an array of player bucket vector of size np), plist_sort (a sorted array of player bucket vector of size np).

Output: nlist (an array of state update vector of maximum size max_nn).

begin

declare integer bs := 256 ; declare integer gs := ceil(nu/bs) ;

for bid:=0 to gs-1 do in parallel for tid:=0 to bs-1 do in parallel

declare integer global_tid := bid*bs + tid ; if global_tid < nb then

declare bucket_update_vector bupd := blist[global_tid] ; declare update_vector upd := ulist[bupd.updateID] ; declare integer si_idx := silist[bupd.bucketID] ; declare bucket_range range := dbrlist[si_idx] ; declare integer base := nclist_scan[global_tid] ; declare state_update supd ;

supd.updateInfo := upd ; for i:=range.left to range.right do

supd.playerID := plist[plist_sort[i]] ; if base+i < max_nn then

nnlist[base + i] := supd ; end

Fig 3-15 Enumerate the updates for the parallel range query algorithm

Algorithm ParallelRangeQuery (plist, np, ulist, nu, nlist, max_nn) ;

Input: plist (an array of player state vectors of size np), rlist (an array of request vectors of size nr).

Output: nlist (an array of notification vectors of maximum size max_nn).

begin

declare global integer plist_sorted[np] ;

declare global bucket_update_vector blist [nu*9] ; declare global bucket_update_vector blist_sorted[nu*9] ; declare global integer seplist[nu*9] ;

declare global integer silist[nu*9] ;

ParallelLoadBalancedRadixSort(plist, np, plist_sorted) ;

ParallelRangeQuery_WriteAffectedBuckets(plist, np, ulist, nu, blist) ; ParallelLoadBalancedRadixSort(blist, nu*9, blist_sorted) ;

ParallelRangeQuery_MarkSeperation(blist_sorted, nu*9, seplist) ; ParallelPrefixScan(seplist, nu*9, silist) ;

declare global integer nb := silist[nu*9-1] ; declare global integer dblist[nb] ;

declare global bucket_range dbrlist[nb] ;

ParallelRangeQuery_StoreSeperation(seplist, silist, nu*9, dblist) ; ParallelRangeQuery_BinarySearch(plist_sorted, np, dblist, nb, dbrlist) ;

declare global integer nclist[nu*9] ; declare global integer nclist_scan[nu*9] ;

ParallelRangeQuery_CountUpdates(blist_sort, nu*9, ulist, nu, silist,nu*9, dbrlist, nb, nclist) ;

ParallelPrefixScan(nclist, nu*9, nclist_scan) ;

ParallelRangeQuery_EnumUpdates(blist_sort, nu*9, ulist, nu, silist, nu*9, dbrlist, nb, plist_sorted, np, uclist_scan, nu*9, plist, plist_sort, np, nlist, max_nn) ;

end

Fig 3-16 The parallel range query algorithm

Chapter 4 Experimental Results and Analysis

4.1. Experimental Setup

We try to evaluate the performance of our GPU MMOG algorithm and compare it with naïve CPU approach to client command processing and updating under a simulated virtual world. Several scenarios with different map size and different AOI (Area-Of-Interest) are simulated on both CPU and GPU. To demonstrate the performance boost and the capability of our GPU algorithms, for each scenario, we vary the number of clients from 512 to 524288 (approx. 0.5M). Suppose each client sends one command to server every one second. Without loss of generality, we assume the inter-arrival time of client commands is uniform. Therefore, for a time span of one second, we expect to receive client commands as many as number of clients.

For each experiment, we evaluate the time for either CPU or GPU to process all client commands that will be arrived within one second to see if it is capable of handling given number of clients. Each setting is ran and analyzed for 100 times for average time to process all client commands and standard deviation of it. Apparently, if the average time to process all client commands is greater than one second, we can say that the setting will leads to server crash, since the server will have infinite number of client commands waiting to process in the end.

4.1.1. Hardware Configuration

To support CUDA computing, we have the following hardware configuration:

CPU Intel Core 2 Duo E6300 (1.83GHz, dual-core) Motherboard ASUS Striker Extreme, NVIDIA 680SLi Chipset RAM Transcand 2G DDR-800

GPU NVIDIA 8800GTX 768MB (MSI OEM) HDD WD 320G w/ 8MB buffer

Table 4-1 Hardware Configuration

Since we want to compare the performance of CPU versus GPU, we list the specification of the GPU in detail as follows:

Code Name GeForce 8800 GTX (G80)

Number of SIMD Processor 16

Number of Registers 8192 (per SIMD processor) Constant Cache 8KB (per SIMD processor) Texture Cache 8KB (per SIMD processor)

Processor Clock Frequency Shader: 1.35 GHz, Core: 575MHz Memory Clock Frequency 900 MHz

Shared Memory Size 16KB (per SIMD processor) Device Memory Size 768MB GDDR3

Table 4-2 NVIDIA 8800GTX Hardware Specification

4.1.2. Software Configuration

OS Windows XP w/Service Pack 2 (32bit version) GPU Driver Version 97.73

CUDA Version 0.81

Visual C++ Runtime MS VC8 CRT ver. 8.0.50727.762

Table 4-3 Software Configuration

For the moment of this writing, CUDA is just released in public only for 3 months. There are still lots of bugs in the toolkit and the runtime library. For example, as you will see later, the data transfer between host memory (i.e. CPU memory) and the device memory (i.e. GPU memory) is somewhat slow due to the buggy runtime library. Also, even if the algorithm is carefully coded with regard to CUDA architecture, several compiler bugs lead to poor performance for the sake of non-coalesced memory access. Fortunately, those bugs are promised to be fixed in the next release of CUDA.

For client command processing simulation on CPU, we make use of STL to implement grid-based world container. Each bucket in the grid is a list of variable length to store client objects. We choose to create a single thread to perform the entire simulation but multiple threads because we want to make it simple without inter-thread communication overhead.

4.2. Evaluation and Analysis

4.2.1. Comparison of CPU and GPU

We choose four different scenarios to find out the differences between CPU and GPU in terms of performances. The selected scenarios are listed in Table 4-4.

Map Size 2500x2500 5000x5000

AOI 10x10 20x20 10x10 20x20

Client Count 512 ~ 524288

Table 4-4 Tested Scenarios

All scenarios are evaluated and the result is summarized as Table 4-5 and Table 4-6. The performance boost of GPU over GPU is calculated and depicted in Table 4-7 and Fig 4-3. The reason that some test case is marked as invalid in the table is that there are too many updates generated so that GPU cannot handle it due to limited memory resources.

From Fig 4-3, we can see the performance improvement by a factor of 5.6 when the number of client is 16384 in a virtual world of size 5000x5000, AOI=20x20. This result is not as good as what we expect to see, however, as GPU comes with 128 ALUs in total, and the GPU memory bandwidth is 30 times faster than CPU.

From our measure, when the number of clients is smaller than 4096, the CPU gives better performance than GPU because the GPU is designed for large number of data set, so it is not fully utilized. However, when the number of client is bigger than 131072 in the 2500x2500 map, CPU again outperforms GPU again. We observe that the reason that GPU fails to give unparallel performance is the limited bandwidth between CPU and GPU inherited from the buggy CUDA runtime.

Average Execution Time

MAP=2500x2500, AOI=10x10 MAP=2500x2500, AOI=20x20

CPU GPU CPU GPU

512 2.303 9.299 5.425 9.348

1024 4.624 10.529 10.869 10.602

2048 9.259 12.255 21.804 12.828

4096 18.711 14.296 43.871 16.082

8192 37.954 19.628 89.346 26.174

16384 77.139 34.020 184.965 59.245

32768 160.524 78.416 393.089 180.017

65536 346.258 226.567 872.226 655.178

131072 787.703 769.591 2020.357 2593.582

262144 1917.623 2903.820 5015.093 10695.969

524288 4982.445 11562.311 INVALID INVALID

Table 4-5 Average Execution Time for map size 2500x2500

Fig 4-1 Average Execution Time for map size 2500x2500

Average Execution Time

MAP=5000x5000, AOI=10x10 MAP=5000x5000, AOI=20x20

CPU GPU CPU GPU

512 2.551 9.271 5.872 9.299

1024 5.094 10.450 11.717 10.503

2048 10.177 12.429 23.392 12.335

4096 20.408 14.028 47.006 14.405

8192 40.924 17.781 94.659 19.602

16384 82.299 26.537 190.806 33.853

32768 166.343 48.732 388.601 78.061

65536 339.565 112.016 804.078 226.177

131072 706.888 307.877 1707.657 766.990

262144 1520.068 963.912 3780.537 2897.588

524288 3421.073 3374.011 8741.187 11540.627

Table 4-6 Average Execution Time for map size 5000x5000

Fig 4-2 Average Execution Time for map size 5000x5000

Performance Improvement Ratio

2500x2500

AOI=10x10

2500x2500 AOI=20x20

5000x5000 AOI=10x10

5000x5000 AOI=20x20

512 0.248 0.580 0.275 0.632

1024 0.439 1.025 0.487 1.116

2048 0.756 1.700 0.819 1.896

4096 1.309 2.728 1.455 3.263

8192 1.934 3.414 2.302 4.829

16384 2.267 3.122 3.101 5.636

32768 2.047 2.184 3.413 4.978

65536 1.528 1.331 3.031 3.555

131072 1.024 0.779 2.296 2.226

262144 0.660 0.469 1.577 1.305

524288 0.431 INVALID 1.014 0.757

Table 4-7 Performance Improvement Ratio of GPU over CPU

Fig 4-3 Performance Improvement Ratio of GPU over CPU

4.2.2. Detail Performance of GPU

Recall that our GPU algorithm performs the server execution in four steps:

1. Upload data to GPU: CPU collect client commands and compile them into an array of data and upload to GPU via PCI-Express bus.

2. Generate/sort client bucket: before processing client commands, client bucket indices are generated and all client objects are sorted into bucket indices. This is used to perform parallel range queries.

3. Process client commands and enumerate updates: count and store the game logics, and generate a list of conflict-free update vectors. Based on sorted client object list, we perform parallel range queries and write the affected neighbor list, and finally update the virtual world.

4. Download the update vectors back to CPU: download all update vectors and affected neighbor list from GPU to CPU.

Among the four steps, the last step is actually extremely time-consuming due to a well-known CUDA bug, that is, memory transfer from GPU to CPU is somewhat slow (roughly about only 1/10 bandwidth only). Also, from our experiences with CUDA and the observed performance of our algorithms, the well-written CUDA program can outperform those poorly-written ones by a factor of 100. For example, our load-balanced parallel radix sort is poorly implemented, resulting in a very slow sorting performance.

Table 4-8 summarizes the time spent at each step of our GPU algorithm for the 2500x2500, AOI=10x10 scenario. Obviously, the time to download update vectors from GPU back to CPU takes more than 95% of the entire execution in the extreme case. While the CPU and GPU are interconnected via the PCI-Express x16 bus, which theoretically delivers more than 4GB/s bandwidth to main memory, the result is not

reasonable and generally regarded as a CUDA bug in current release. Since there is no asynchronous read-back in the current CUDA release, we cannot resolve the issue currently.

Detailed Execution Time of GPU Algorithm

Upload Bucket Sorting Logic Processing Download

512 0.025 5.198 3.991 0.085

1024 0.033 5.291 5.028 0.177

2048 0.046 5.480 6.341 0.388

4096 0.070 5.933 7.193 1.099

8192 0.128 6.885 9.228 3.388

16384 0.235 8.770 12.801 12.215

32768 0.387 12.845 20.762 44.422

65536 0.688 21.738 37.891 166.249

131072 1.323 39.133 76.725 652.409

262144 2.556 72.936 168.346 2659.982

524288 5.053 137.026 413.857 11006.376

Table 4-8 Detailed Execution Time of GPU Algorithm at Each Step

4.2.3. Comparison of CPU and GPU with Computation Only

Since the data read-back performance is pretty crappy, in this section, we try to consider the GPU computational performance only. We subtract the entire execution time by the time to upload client commands data and the time to download update vectors. Only the bucket sorting and command logic processing are considered to compare the computational power of GPU versus CPU.

Average Execution Time (GPU w/ Computation Only)

MAP=2500x2500, AOI=10x10 MAP=2500x2500, AOI=20x20

CPU GPU CPU GPU

512 2.303 9.189 5.425 9.209

1024 4.624 10.319 10.869 10.296

2048 9.259 11.821 21.804 11.980

4096 18.711 13.126 43.871 13.271

8192 37.954 16.112 89.346 16.100

16384 77.139 21.570 184.965 22.135

32768 160.524 33.607 393.089 34.549

65536 346.258 59.629 872.226 62.369

131072 787.703 115.858 2020.357 127.571

262144 1917.623 241.282 5015.093 326.096

524288 4982.445 550.882 INVALID INVALID

Table 4-9 Average Time for map size 2500x2500 (GPU w/ Computation Only)

Fig 4-4 Average Execution Time for map size 2500x2500 (GPU w/ Computation Only)

Average Execution Time (GPU w/ Computation Only)

MAP=5000x5000, AOI=10x10 MAP=5000x5000, AOI=20x20

CPU GPU CPU GPU

512 2.551 9.167 5.872 9.191

1024 5.094 10.285 11.717 9.191

2048 10.177 12.120 23.392 11.889

4096 20.408 13.310 47.006 13.234

8192 40.924 16.051 94.659 16.006

16384 82.299 21.584 190.806 21.628

32768 166.343 33.409 388.601 33.684

65536 339.565 58.235 804.078 59.650

131072 706.888 110.369 1707.657 114.769

262144 1520.068 223.584 3780.537 241.614

524288 3421.073 475.944 8741.187 550.768

Table 4-10 Average Time for map size 5000x5000 (GPU w/ Computation Only)

Fig 4-5 Average Execution Time for map size 5000x5000 (GPU w/ Computation Only)

Performance Improvement Ratio (GPU w/ Computation Only)

2500x2500

AOI=10x10

2500x2500 AOI=20x20

5000x5000 AOI=10x10

5000x5000 AOI=20x20

512 0.577 1.357 0.278 0.639

1024 0.920 2.173 0.495 1.275

2048 1.460 3.369 0.840 1.968

4096 2.601 5.966 1.533 3.552

8192 4.113 9.692 2.550 5.914

16384 6.026 13.881 3.813 8.822

32768 7.731 18.103 4.979 11.537

65536 9.138 21.088 5.831 13.480

131072 10.267 22.041 6.405 14.879

262144 11.391 19.079 6.799 15.647

524288 12.039 INVALID 7.188 15.871

Table 4-11 Performance Improvement Ratio of GPU over CPU (GPU w/ Computation Only)

Fig 4-6 Performance Improvement Ratio of GPU over CPU (GPU w/ Computation Only)

4.2.4. Comparison of Different AOIs

Based on different design methodology in the virtual world representation, we observe some differences between the grid-based approach and the GPU-based approach. For grid-based approach, we simply make a large array with each element as a variable-length linked-list. Client objects are stored in the list and are searched in a sequential way for each update. For GPU-based approach, recall that we don’t have a grid on GPU memory, but instead, we sort the client objects according to their bucket indices and then perform N-way binary search to find affected neighbors.

Apparently, the performance of grid-based approach is dominated by the average number of clients in the area of interest and the size of area of interest. The larger the size of AOI is, the more cells in the grid needed to be traversed are. However, the change of AOI does not change the behavior of GPU-based approach, and we will have same performance if the average number of clients in AOI remains the same.

From Fig, CPU performance loss are observed when the configuration changes from 2500x2500 with AOI=10x10 to 5000x5000 with AOI=20x20, while the GPU performances in the two configuration are almost identical.

在文檔中利用圖形處理器加速鉅量多人連線遊戲伺服器端之運算 (頁 34-0)