In the previous subsection, we have shown that running GGCE on μ < m equations can be useful. In this section, we will show that the well-known Gaussian elimination can make our solver even faster.
Note that in GGCE, there are some constant data, namely the Cβ1,...,βd’s (or δβ1,...,βd’s). According to Fig. 2.2(b), actions involving them are always in the form
δβ1,...,βd−1+ = Cβ1,...,βd.
xxvi
The key point is that if Cβ1,...,βd = 0, all such actions can be simply omitted.
Now suppose GGCE is used to solve the first μ equations f(0) to f(μ−1). The tar-get system can be treated as a matrix, where each row corresponds to one equation, and each column corresponds to one term containing the same monomial. Appar-ently, the solution space is invariant under elementary row operations. Thus, we may eliminate some C(i)β1,...,β
d’s for all 0 ≤ i ≤ μ − 1. In fact, we can always eliminate m − μ such coefficients. In Section 4.1, we will discuss the probability of accessing each differential, and apparently we should eliminate the most frequently used m−μ C(i)β
1,...,βd’s for all 0≤ i ≤ μ − 1 to achieve the greatest saving.
Chapter 4
Implementations
4.1 On NVIDIA GPUs, with CUDA
4.1.1 Overview
Our implementations allow parallel use of multiple GPU devices. Thus, at the beginning, the input system should be partially evaluated into intermediary systems, whose number equals to the number of devices. Then the same number of processes would be invoked such that each of them solves one of the intermediary systems by launching a kernel on one device.
Each of these processes would solve the first 32 equations with GGCE (enumer-ation phase), which would take place on the device. The solutions found by GGCE would be checked against the remaining equations using na¨ıve evaluation (check phase) on CPU. We pick the number 32 to match the register width on GPUs. In this way, each 32-bit differential can be store in a register, and accumulating can be done by performing bitwise XOR on them.
Before the enumeration phase starts, some preperation must be done first. Since a GPU kernel usually requires enough threads to hide instruction latency, the first 32 equations of each intermediary system has to be partially evaluated into a large number of small systems, each to be solve by one GPU thread with GGCE. Then, the non-common parts (Cβ1,...,βk’s where k < d) of the small systems would be sent to
xxviii
global memory, while the common parts (Cβ1,...,βd’s) are stored in constant memory.
After that, the enumeration phase can take the coefficients as input and run the instances of GGCE concurrently. The details and important issues in this phase will be introduced in the following subsections. By the end of this phase, the solution found by each thread would be stored in global memory, so they can be moved back to CPU for further processing.
The check phase is straightforward compared to the previous phase. Since only few (compared to 2n) solutions are expected to entering this phase, there are no fancy techniques involved. The only notable issue in this phase is that it actually handles some “mending” work, which will also be discussed in Section 4.1.5.
4.1.2 Register Usage
Because of the scarcity of fast memory on GPU, register usage is usually a critical issue for CUDA programmers. The problem, unfortunately, seems inevitable since the number of differentials grows rapidly as d increases. Actually, the problem can be even worse since NVIDIA’s nvcc compiler tends to allocate more registers than necessary. In fact, in our implementation for quadratic systems, everything fits in the registers after initialization. On the contrary, this is not the case in implementations for cubics and quartics.
In our implementation for quartics, each thread needs to maintain δi,j,k for 0≤ i < j < k < K, where K is the number of variables in the small systems. For K = 10, δi,j,k’s take 120 registers if we just store all of them in registers, making the number of active warps in each MP no more than 4, not to mention others differentials. One may argue that this problem can be solved by restricting the number K. However, this implies that an extremely deep partial evaluation has to be carried out, which can be both time and memory consuming.
Our strategy for register usage follows the principle of caching—storing the most frequently used things in registers. To be precise, each differential δ∗k is accessed with probability 2−(k+1) in each attempt. In other words, there exists a strong
bias in the probabilities of accessing each differential. So we can pick a suitable number γ and store all the δ∗k with k ≤ γ in registers and other differentials, global memory. The number of variables and actual number of registers allocated in our GPU implementations are shown in Table 4.1.
Table 4.1: Number of Registers Allocated in GPU Implementations d = 3 (γ = 9) d = 4 (γ = 7)
diffs other actual diffs other actual
46 11 64 64 15 80
4.1.3 Unrolling
While accumulating takes only few XORs (of differentials), indexing causes con-siderable overhead in each attempt. However, we find that the ubiquitous trick—
unrolling—can be quite helpful to alleviate the overhead.
Let us take a look at Fig. 4.1, which exemplifies our unrolling scheme for a GGCE solving quartics with unroll factor 8. The indices listed are all in the same unrolled block. It is shown that some entries, such as b2(∗ · · · ∗ 011), are constant. Other entries, although not fixed, can be determined once all bi(∗ · · · ∗ 000)’s are known.
Thus, the indexing is no longer needed in every attempt. Instead, it only needs to be invoked in attempts with index being an integral multiple of 8. This example also illustrates the cases for quadratics and cubics.
The reason for this is not really complex. Consider any unrolled block with 2u indices, where the first index is i. Any of other indices in this unrolled block can be defined as i = i + k, where k < 2u. Thus, the indices of the least sig-nificant HammingWeight(k) nonzero bits in i must be constant and can be de-termined before runtime. If we need any more indices bj(i) for i (which means HammingWeight(k) < d), it can always be computed by bj(i) = bj−h(i), where h = HammingWeight(k) and j > h.
xxx
Figure 4.1: An Example Unrolled Block with 8 Indices index b4 b3 b2 b1
∗ · · · ∗ 000 β4 β3 β2 β1
∗ · · · ∗ 001 β3 β2 β1 0
∗ · · · ∗ 010 β3 β2 β1 1
∗ · · · ∗ 011 β2 β1 1 0
∗ · · · ∗ 100 β3 β2 β1 2
∗ · · · ∗ 101 β2 β1 2 0
∗ · · · ∗ 110 β2 β1 2 1
∗ · · · ∗ 111 β1 2 1 0
4.1.4 Testing with Conditional Move
In Fig. 2.2(b), the testing simply adds the candidate vector into a set if its image is zero. However, this is infeasible in GPU implementations, for device memory is limited. Even if we assume the memory is large enough to store all candidate vectors, this can induce other problems such as synchronization between threads.
Actually, the testing in our GPU implementation is delicately designed to fit the hardware platform.
In the process of implementation, we discovered an undocumented feature of CUDA for G2xx series GPUs: nvcc reliably generates conditional (predicated) move instructions, which are dispatched with exceptional adeptness. According to our experiment results, we believed that conditional moves can be dispatched by SFUs (Special Function Units), so the they can be executed simultaneously with other instructions (such as XORs) handled by SPs (Streaming Processors).
In order to exploit the feature, the testing in GPU implementations is somewhat different with the pseudocode. Each thread maintains two registers, count and sol, to keep track of the solution count and the last solution found during runtime. Note that count does not actually record solution count accurately. Instead, it contains only three states: 0, 1, and 2+ (greater or equal to 2 solutions). While the two registers can be easily initialized, they should be correctly maintained after each unrolled block, which can be achieved by maintaining the same but local data for each unrolled block.
For each unrolled block, we use a tiny queue Q with capacity of merely two
elements to maintain the local data. Whenever a solution is found, the corresponding test vector (actually a part of it) is enqueued. In this way, we can tell whether the solution count is 0, 1, or 2+ by checking whether there are 0, 1, or 2 elements in Q at the end of each unrolled block. Moreover, the last solution (if any) must lies in the back of Q.
In order to show the power of this technique, some actual CUDA codes are presented in Table 4.2(b). After applying decuda to our program, we found that the repetitive four-line code segments correspond to at least four instructions including two XORs and two conditional moves. However, according to our experiment result, the four instructions average less than three SP cycles, which means executions of XORs and conditional moves are much overlapped.
Figure 4.2: CUDA and Cubin Code Fragments of Degree-2 GPU Implementation
...
xor.b32 $r19, $r19, c0[0x000c] // d_y^=d_yz xor.b32 $p1|$r20, $r17, $r20
mov.b32 $r3, $r1
mov.b32 $r1, s[$ofs1+0x0038]
xor.b32 $r4, $r4, c0[0x0010]
xor.b32 $p0|$r20, $r19, $r20 // res^=d_y
@$p1.eq mov.b32 $r3, $r1
@$p1.eq mov.b32 $r1, s[$ofs1+0x003c]
xor.b32 $r19, $r19, c0[0x0000]
xor.b32 $p1|$r20, $r4, $r20
@$p0.eq mov.b32 $r3, $r1 // cmov
@$p0.eq mov.b32 $r1, s[$ofs1+0x0040] // cmov ...
...
diff0 ^= deg2_block[ 3 ]; // d_y^=d_yz
res ^= diff0; // res^=d_y
(a) decuda Result from Cubin (b) CUDA Code for an Inner Loop Fragment
4.1.5 Re-enumeration
In the check phase, the solutions found in enumeration phase are examined. If a thread has found more than one solution, some mending work must be done on CPU or we may miss some actual solutions of the target system. A remedy for this is to repeat the work (GGCE) done by the thread again, which we call “re-enumeration.”
In fact, the re-enumeration can be aborted once the candidate solution meet the last solution returned by the thread. However, this is still a situation we would like to avoid as much as possible. Thus, our solution is to reduce the probability that any thread has found more than one solution by restricting the number of variables in
xxxii
the small systems. The smaller the solution space is, the little the probability would be. In fact, when n = 48, we restrict the number of variables (in small systems) to 26 to make re-enumerations take negligible time.