Chapter 3 Building a Data Base on GPU Memory
3.3 Functions of Data Base on GPU
3.3.1 Data Structure
At first, conventional data base uses tree structure to manage data or indices. Obviously, B-tree searching algorithm is not appropriate in parallel computing architecture. Unparallel
16
computing in each core can’t bring the potential computing capability of GPU into full play.
The access to device memory usually takes up to 200~300 clocks, which is relatively slow to on-chip memory. Searching in B-tree will bring too many access times to device memory that attacks the performance of GPU. Normally, tree data structure is constructed by link list nodes. Link list nodes are connected by pointers, so achieving continuous memory access is not easy. Non-continuous memory access reduces the opportunity of coalesced memory access.
In order to solve problems we mentioned before, array is used as a data structure in GPU memory which stores tables and temporal data during computation. Searching in array can be easier than tree structure. An array supports random access. Data can be compared in parallel according to thread ID.
Figure 3 - 2 Data structure of data table.
Second, to reduce the complexity of design, a column is usually a unit of parallel computation in data base. Because the address of GPU memory is in row-major order, storing our tables in row major order will increase the opportunity of non-coalesced memory access.
To avoid this problem, tables are stored in column major in GPU memory.
Figure 3 - 2 Data structure of data table. shows the relationship between records and memory
17
address. The data of entire column can be easily compared with conditions by storing data table in column major. Threads performs the same operation in each data and sequentially access memory that increases the opportunity of coalesced memory access.
3.3.2 Selection Query
At first, let’s consider how Selection Query works in basis. After condition parameters are set, the process compares each column of every record. Assume all columns of one record meet all condition parameters; the record is selected and then set “1” to the corresponding position of flag array. Finally, we check the flag array of table, then copy these records marked and to another table as a result. To parallelize Selection Query process, one thread is assigned to one record. Each thread compares all columns of their assigned record iteratively, but all threads compare each record in parallel.
Acutely, there are two perplexities we have to consider before starting the implementation work:
1. In logical operation, the priority of “AND” and “OR” are different. The different priorities determine which operator supposed to be executed fist. In common case, the priority of “AND” is higher than “OR”, so “AND” operator supposed to be executed first.
2. The branching within the same block could be expensive as they are executed on a SIMD processor, where only one instruction can be performed (with multiple different data source). So if threads take different execution path, they must be serialized by the thread scheduler on GPU (divergence branch).
To solve these perplexities, the ordinary prefix notation has to be transformed into postfix notation before starting process. Each record needs one stack used by threads during the postfix notation is processed. In Figure 3 - 3, threads compared each record with condition parameters set by user and manipulate stacks to calculate which record is selected.
18
Figure 3 - 3 Process of Selection Query
Because the branching within the same block could be expensive, host computer should be responsible for the partial flow control to avoid divergence branch.
As entire thread finished, the value in the bottom of each stack denote which record was selected. Then, we plan to copy this selected records and move to another 2-D array as a final result and transmit it to host. Now we are focus on what the position of these selected records in the 2-D Array is correct.
This problem is easy to solve by using the function provided by CUDPP, cudppScan. The cudppScan performs a prefix sum operation on the flag array in GPU memory and outputs the array of corresponding position. Beside, the number of total selected records is the last value of the output which determines the size of result table. The concept of Selection Query we implemented is described as Figure 3 - 4, Figure 3 - 5, Figure 3 - 6.
Algorithm SelectionQuery_SetSelectFlag(QueryTable, QueryData, selected_flag ) Input: QueryData (denotes what data supposed to be selected)
Output: selected_flag (an array of flags denotes which record is selected) Begin
declare stack_d[][];
declare top_d[];
declare integer cnt;
19
Algorithm SelectGreater_kernelProgram (dataTable, stack, top, selected_flag, column)
Input: dataTable (the address of data table in the GPU memory) column(index of column)
Output: selected_flag (an array of flags denotes which record is selected) begin
for idx = 1 to (the size of data table) do in parallel top[idx]++;
if (dataTable[column] [idx] > value) then stack[index][top[idx]] =1;
for i=1 to 2*numberOfOperand-1 do switch(token of postfix) {
case operand:
switch(operator) {
case “>”: call selectGreater_kernelProgram case “<”: call selectSmaller_kernelProgram case “=”: call selectEqual_kernelProgram
:
call selectOR_kernelProgram }
Figure 3 - 4 Algorithm of Selection Query
Figure 3 - 5 Algorithm of Greater Process in Selection Query
20
Figure 3 - 7 Process of moving data queried to the result table.
The Figure 3 - 7 shows the algorithm of moving records to result table. Assume we have an 10 records table, so we assign 10 threads to each record. In Figure 3 - 7, t0~t9 means threads with tread ID [0] ~ thread ID [9]. Each thread of thread ID [i] checks two values, one is in the flags of selected records and the other one is in the result array of pre-fix sum function. If the value in the flag array of selected records is “1” which means the value of corresponding address in the result array of pre-fix sum function is the new position of selected record in result table. Finally, duplicate selected records and move to the result table.
Algorithm selectAND_kernelProgram (dataTable, stack, top, selected_flag) Input: dataTable (the address of data table in the GPU memory)
Output: selected_flag (an array of flags denotes which record is selected) begin
for index = 1 to (the size of data table) do in parallel
stack[index][ top[index]-1] = stack[index][ top[index]-1] &
stack[index][ top[index]];
top[index]- -;
end
Figure 3 - 6 Algorithm of AND Process in Selection Query
21