In the last few sections, we introduced various exhaustive-search solvers intended for solving a single system. However, as mentioned in section 1, parallel computing is a more efficient way of taking advantage of Moore’s law. This leads to the need of parallelization. In other words, we need to divide our problem into pieces, so that resulting subproblems can be solved concurrently.
An intuitive idea is partial evaluation. That is, we shall divide the target system into multiple subsystems by substituting all possible values for s variables. In this
way, there would be 2s subsystems, each with n − s variables, and the number and size of subsystems can be controlled simply by changing s. Moreover, we have found that partial evaluation can be made efficient by using GGCE as a subroutine.
Recall that we use c∗ for coefficients of subsystems and C∗ for those of the original system. Now consider a specific coefficient, say c, of one of the subsystems.
The coefficient is an image of a polynomial system h defined over the substituted variables. In fact, if we collect the same coefficient in all subsystems, the resulting set actually forms the range of h, which can be computed by GGCE (without testing) since it generates all images of a system. Thus, partial evaluation can be done by computing with GGCE all c’s, c0’s, c1’s, . . . , etc, until all d
We use an example to show this concept more clearly. Let us consider a case with n = 4, d = 2, and s = 2, where the variables to be substituted are x2 and x3. Now, the original system can be written in the following expression:
C0,1x0x1+(C0,2x2+C0,3x3+C0)x0+(C1,2x2+C1,3x3+C1)x1+(C2,3x2x3+C2x2+C3x3+C).
It can be inferred from the expression that the c’s form the range of C2,3x2x3 + C2x2+ C3x3+ C, the c1’s form the range of C1,2x2+ C1,3x3+ C1, . . . , and so on.
Note that the degree-d terms of all subsystems actually come from the original system. In other words, cα1,...,αd = Cα1,...,αd for any subsystem. Thus, we do not need to evaluate cα1,...,αd’s by GGCE. Instead, we may simply copy from the original system those coefficients when we need them.
Time complexity. According to our arguments and example, it is clear that coefficients (in subsystems) of a degree-k monomial can be generated by running GGCE on a degree-(d − k) system defined over the s substituted variables. Thus, the bit operations required (per equation) for partial evaluation can be expressed by:
Memory Issues. Since in partial evaluation, the generation of cs is done by running GGCE on a degree-d system with s variables, at most MGGCE(ro) (s, d) + MGGCE(rw) (s, d) bits per equation is required to complete the whole process. While this memory cost is usually affordable, memory problems brought by partial evalu-ation are usually due to the subsystems it generates. To be exact, the subsystems
should take
bits per equation, where n−s
d
bits are for cα1,...,αd’s. Thus, when s is sufficiently
large, the subsystems can take a huge amount of memory. However, sometimes we do not need all the subsystems at the same time, in which case there are at least two ways to mitigate the memory problem.
The first solution is to perform a multi-level partial evaluation, which is suitable when only 2s of all 2s subsystems need to be dealt with at the same time. A two-level partial evaluation goes like this. First we divide the target system into 2s−s
“intermediary” systems. Then, we pick one of them at a time and divide it into 2s
“final” systems which are meant to be processed together. In this way, as long as the intermediary systems do not take much memory, we only need the memory for the 2s final systems.
The second solution is to run all instances of GGCE (in partial evaluation) at the same time, which might be useful when the 2s subsystems are dealt with one by one. Note that attempts with the same index of all instances actually generate all coefficients of the same subsystem. Thus, by running all instances synchronously, we may generate the subsystems one by one. In fact, there is a one-to-one mapping between the terms in the original system and the terms in all coefficient-generating systems (h’s); specifically, there is a one-to-one mapping between the highest-degree terms in coefficient-generating systems and the degree-d terms in the original system.
Thus, by the discussions about memory issues in GGCE, we may conclude that this scheme uses the same amount of read-only and read-write memory with running GGCE directly on the original system.
Chapter 3
Variants and Analysis
3.1 Early-abort Strategy
3.1.1 In Na¨ıve Evaluation
While the na¨ıve evaluation has been proved to be outperformed by several solvers, it can actually be improved by taking advantage of an early-abort strategy. All we need to do is to treat the equations as a sequence of candidate filters, and each candidate vector inFn2 would be examined by the filters one by one until it is filtered out. Let V(0) = Fn2, the initial space of candidate vectors. Formally speaking, for each f(i) we can compute f(i)(V(i)) and arrive at V(i+1) ={v ∈ V(i) | f(i)(v) = 0}.
Since on average each candidate vector is filtered out with probability 0.5, we only need to examine two filters on average. Consequently, the average number of bit operations required would be:
2· BEval(n, d) =
d i=0
n i
2n−i+1. (3.1)
3.1.2 In GGCE
The way we treat equations as filters is apparently not suitable for GGCE, for it needs to enumerate all Fn2. However, can GGCE be modified so that it computes only f(i)(V(i))? One method that has come across our mind goes like this. To
xxiv
compute f(i)(V(i)), f(i) is first partially evaluated with some well-chosen s(i). Then, for each v∈ V(i), f(i)(v) can be evaluated by substituting n − s bits of v into the corresponding subsystem.
Since the costs for partial evaluation and na¨ıve evaluation are known, the number of bit operations required for computing f(i)(V(i)) can be expressed by:
BP artial(n, d, s(i)) + 2(n−i)· BEval Sum Avg(n − s(i), d),
where 2(n−i) stands for the expected number of|V(i)|. According to this expression, the cost of computing f(i)(V(i)) is fully dependent of s(i). Therefore, minimizing the total number of bit operations required can be simply done by finding the best s(i) for each i independently. Unfortunately, after trying mathematical techniques such as the first derivative test, we found it hard to express the best s(i)in a closed general form. Thus, we use an empirical approach instead, in which we search for the best s(i) in the interval [0, n] for each i. The same procedure can be repeated several times for different settings of (m, n, d) to gain enough generality. According to our experiment result, the sequence [s(0), s(1), l . . . , s(m−1)] is usually in the pattern of [n, n, n − k1, n − k1− 1, . . . , k2, 0, . . . , 0], where k1, k2 are some small positive integer.
This implies that [n, n, n − 1, n − 2, . . . , 1, 0, . . . , 0] might be generally a good choice.
Thus, we can approximate the total number of bit operations required with:
m−1
We note that this upper bound may not be precise since the cost of initialization in GGCEs (in partial evaluation) is ignored from the line starting with “.” How-ever, when 32≤ m = n ≤ 64 and 2 ≤ d ≤ 4, we find that this latter cost is indeed negligible in theory. Also remember that we assume |V(i)| = 2n−i, which might not be accurate for some cases.
In some sense, this scheme is a mix of GGCE (or partial evaluation) and na¨ıve evaluation. By choosing s(i), we may determine the weights of the two methods in the computation of f(i)(V(i)). That is, when s(i) is close to n, the scheme highly resembles GGCE. On the other hand, when s(i) is close to 0, the scheme is more like the na¨ıve evaluation. Actually, this viewpoint is consistent with our experiment result: GGCE is more suitable for computation of f(i)(V(i)) when |V(i)| is close to 2n (or equivalently, when i is small), and vice versa.
The importance of this scheme resides in its flexibility. This scheme not only con-tains the solvers described in Section 2.3, 2.5, and 3.1.1, but also allows time-memory trade-off. Furthermore, it can be easily adapted according to the implementation hardware platform.
Note that using different s(i) for each equation might not be suitable for general hardware platform such as GPUs and CPUs, for they lack the capability of efficient handling of bit vectors of a wide variety of widths. For special devices such as FPGAs, however, the scheme might work well.