Gaussian Elimination - 解佈於二元體之多項式方程組之快速窮舉法

In the previous subsection, we have shown that running GGCE on μ < m equations can be useful. In this section, we will show that the well-known Gaussian elimination can make our solver even faster.

Note that in GGCE, there are some constant data, namely the C_β₁_,...,β_d’s (or δβ₁,...,β_d’s). According to Fig. 2.2(b), actions involving them are always in the form

δβ₁,...,β_d−1+ = C_β₁_,...,β_d.

xxvi

The key point is that if C_β₁_,...,β_d = 0, all such actions can be simply omitted.

Now suppose GGCE is used to solve the ﬁrst μ equations f⁽⁰⁾ to f^(μ−1). The tar-get system can be treated as a matrix, where each row corresponds to one equation, and each column corresponds to one term containing the same monomial. Appar-ently, the solution space is invariant under elementary row operations. Thus, we may eliminate some C⁽ⁱ⁾_β₁_,...,β

d’s for all 0 ≤ i ≤ μ − 1. In fact, we can always eliminate m − μ such coeﬃcients. In Section 4.1, we will discuss the probability of accessing each diﬀerential, and apparently we should eliminate the most frequently used m−μ C⁽ⁱ⁾_β

1,...,β_d’s for all 0≤ i ≤ μ − 1 to achieve the greatest saving.

Chapter 4 Implementations

4.1 On NVIDIA GPUs, with CUDA

4.1.1 Overview

Our implementations allow parallel use of multiple GPU devices. Thus, at the beginning, the input system should be partially evaluated into intermediary systems, whose number equals to the number of devices. Then the same number of processes would be invoked such that each of them solves one of the intermediary systems by launching a kernel on one device.

Each of these processes would solve the ﬁrst 32 equations with GGCE (enumer-ation phase), which would take place on the device. The solutions found by GGCE would be checked against the remaining equations using na¨ıve evaluation (check phase) on CPU. We pick the number 32 to match the register width on GPUs. In this way, each 32-bit diﬀerential can be store in a register, and accumulating can be done by performing bitwise XOR on them.

Before the enumeration phase starts, some preperation must be done ﬁrst. Since a GPU kernel usually requires enough threads to hide instruction latency, the ﬁrst 32 equations of each intermediary system has to be partially evaluated into a large number of small systems, each to be solve by one GPU thread with GGCE. Then, the non-common parts (C_β₁_,...,β_k’s where k < d) of the small systems would be sent to

xxviii

global memory, while the common parts (C_β₁_,...,β_d’s) are stored in constant memory.

After that, the enumeration phase can take the coeﬃcients as input and run the instances of GGCE concurrently. The details and important issues in this phase will be introduced in the following subsections. By the end of this phase, the solution found by each thread would be stored in global memory, so they can be moved back to CPU for further processing.

The check phase is straightforward compared to the previous phase. Since only few (compared to 2ⁿ) solutions are expected to entering this phase, there are no fancy techniques involved. The only notable issue in this phase is that it actually handles some “mending” work, which will also be discussed in Section 4.1.5.

4.1.2 Register Usage

Because of the scarcity of fast memory on GPU, register usage is usually a critical issue for CUDA programmers. The problem, unfortunately, seems inevitable since the number of diﬀerentials grows rapidly as d increases. Actually, the problem can be even worse since NVIDIA’s nvcc compiler tends to allocate more registers than necessary. In fact, in our implementation for quadratic systems, everything ﬁts in the registers after initialization. On the contrary, this is not the case in implementations for cubics and quartics.

In our implementation for quartics, each thread needs to maintain δi,j,k for 0≤ i < j < k < K, where K is the number of variables in the small systems. For K = 10, δi,j,k’s take 120 registers if we just store all of them in registers, making the number of active warps in each MP no more than 4, not to mention others diﬀerentials. One may argue that this problem can be solved by restricting the number K. However, this implies that an extremely deep partial evaluation has to be carried out, which can be both time and memory consuming.

Our strategy for register usage follows the principle of caching—storing the most frequently used things in registers. To be precise, each diﬀerential δ_∗k is accessed with probability 2^−(k+1) in each attempt. In other words, there exists a strong

bias in the probabilities of accessing each diﬀerential. So we can pick a suitable number γ and store all the δ_∗k with k ≤ γ in registers and other diﬀerentials, global memory. The number of variables and actual number of registers allocated in our GPU implementations are shown in Table 4.1.

Table 4.1: Number of Registers Allocated in GPU Implementations d = 3 (γ = 9) d = 4 (γ = 7)

diﬀs other actual diﬀs other actual

46 11 64 64 15 80

4.1.3 Unrolling

While accumulating takes only few XORs (of diﬀerentials), indexing causes con-siderable overhead in each attempt. However, we ﬁnd that the ubiquitous trick—

unrolling—can be quite helpful to alleviate the overhead.

Let us take a look at Fig. 4.1, which exempliﬁes our unrolling scheme for a GGCE solving quartics with unroll factor 8. The indices listed are all in the same unrolled block. It is shown that some entries, such as b2(∗ · · · ∗ 011), are constant. Other entries, although not ﬁxed, can be determined once all bi(∗ · · · ∗ 000)’s are known.

Thus, the indexing is no longer needed in every attempt. Instead, it only needs to be invoked in attempts with index being an integral multiple of 8. This example also illustrates the cases for quadratics and cubics.

The reason for this is not really complex. Consider any unrolled block with 2^u indices, where the ﬁrst index is i. Any of other indices in this unrolled block can be deﬁned as i = i + k, where k < 2^u. Thus, the indices of the least sig-niﬁcant HammingWeight(k) nonzero bits in i must be constant and can be de-termined before runtime. If we need any more indices bj(i) for i (which means HammingWeight(k) < d), it can always be computed by bj(i) = bj−h(i), where h = HammingWeight(k) and j > h.

xxx

Figure 4.1: An Example Unrolled Block with 8 Indices index b4 b3 b2 b1

∗ · · · ∗ 000 β4 β3 β2 β1

∗ · · · ∗ 001 β3 β2 β1 0

∗ · · · ∗ 010 β3 β2 β1 1

∗ · · · ∗ 011 β2 β1 1 0

∗ · · · ∗ 100 β3 β2 β1 2

∗ · · · ∗ 101 β2 β1 2 0

∗ · · · ∗ 110 β2 β1 2 1

∗ · · · ∗ 111 β1 2 1 0

4.1.4 Testing with Conditional Move

In Fig. 2.2(b), the testing simply adds the candidate vector into a set if its image is zero. However, this is infeasible in GPU implementations, for device memory is limited. Even if we assume the memory is large enough to store all candidate vectors, this can induce other problems such as synchronization between threads.

Actually, the testing in our GPU implementation is delicately designed to ﬁt the hardware platform.

In the process of implementation, we discovered an undocumented feature of CUDA for G2xx series GPUs: nvcc reliably generates conditional (predicated) move instructions, which are dispatched with exceptional adeptness. According to our experiment results, we believed that conditional moves can be dispatched by SFUs (Special Function Units), so the they can be executed simultaneously with other instructions (such as XORs) handled by SPs (Streaming Processors).

In order to exploit the feature, the testing in GPU implementations is somewhat diﬀerent with the pseudocode. Each thread maintains two registers, count and sol, to keep track of the solution count and the last solution found during runtime. Note that count does not actually record solution count accurately. Instead, it contains only three states: 0, 1, and 2+ (greater or equal to 2 solutions). While the two registers can be easily initialized, they should be correctly maintained after each unrolled block, which can be achieved by maintaining the same but local data for each unrolled block.

For each unrolled block, we use a tiny queue Q with capacity of merely two

elements to maintain the local data. Whenever a solution is found, the corresponding test vector (actually a part of it) is enqueued. In this way, we can tell whether the solution count is 0, 1, or 2+ by checking whether there are 0, 1, or 2 elements in Q at the end of each unrolled block. Moreover, the last solution (if any) must lies in the back of Q.

In order to show the power of this technique, some actual CUDA codes are presented in Table 4.2(b). After applying decuda to our program, we found that the repetitive four-line code segments correspond to at least four instructions including two XORs and two conditional moves. However, according to our experiment result, the four instructions average less than three SP cycles, which means executions of XORs and conditional moves are much overlapped.

Figure 4.2: CUDA and Cubin Code Fragments of Degree-2 GPU Implementation

...

xor.b32 $r19, $r19, c0[0x000c] // d_y^=d_yz xor.b32 $p1|$r20, $r17, $r20

mov.b32 $r3, $r1

mov.b32 $r1, s[$ofs1+0x0038]

xor.b32 $r4, $r4, c0[0x0010]

xor.b32 $p0|$r20, $r19, $r20 // res^=d_y

@$p1.eq mov.b32 $r3, $r1

@$p1.eq mov.b32 $r1, s[$ofs1+0x003c]

xor.b32 $r19, $r19, c0[0x0000]

xor.b32 $p1|$r20, $r4, $r20

@$p0.eq mov.b32 $r3, $r1 // cmov

@$p0.eq mov.b32 $r1, s[$ofs1+0x0040] // cmov ...

...

diff0 ^= deg2_block[ 3 ]; // d_y^=d_yz

res ^= diff0; // res^=d_y

(a) decuda Result from Cubin (b) CUDA Code for an Inner Loop Fragment

4.1.5 Re-enumeration

In the check phase, the solutions found in enumeration phase are examined. If a thread has found more than one solution, some mending work must be done on CPU or we may miss some actual solutions of the target system. A remedy for this is to repeat the work (GGCE) done by the thread again, which we call “re-enumeration.”

In fact, the re-enumeration can be aborted once the candidate solution meet the last solution returned by the thread. However, this is still a situation we would like to avoid as much as possible. Thus, our solution is to reduce the probability that any thread has found more than one solution by restricting the number of variables in

xxxii

the small systems. The smaller the solution space is, the little the probability would be. In fact, when n = 48, we restrict the number of variables (in small systems) to 26 to make re-enumerations take negligible time.

在文檔中解佈於二元體之多項式方程組之快速窮舉法 (頁 29-36)