Partial Evaluation - 解佈於二元體之多項式方程組之快速窮舉法

In the last few sections, we introduced various exhaustive-search solvers intended for solving a single system. However, as mentioned in section 1, parallel computing is a more eﬃcient way of taking advantage of Moore’s law. This leads to the need of parallelization. In other words, we need to divide our problem into pieces, so that resulting subproblems can be solved concurrently.

An intuitive idea is partial evaluation. That is, we shall divide the target system into multiple subsystems by substituting all possible values for s variables. In this

way, there would be 2^s subsystems, each with n − s variables, and the number and size of subsystems can be controlled simply by changing s. Moreover, we have found that partial evaluation can be made eﬃcient by using GGCE as a subroutine.

Recall that we use c_∗ for coeﬃcients of subsystems and C_∗ for those of the original system. Now consider a speciﬁc coeﬃcient, say c, of one of the subsystems.

The coeﬃcient is an image of a polynomial system h deﬁned over the substituted variables. In fact, if we collect the same coeﬃcient in all subsystems, the resulting set actually forms the range of h, which can be computed by GGCE (without testing) since it generates all images of a system. Thus, partial evaluation can be done by computing with GGCE all c’s, c₀’s, c₁’s, . . . , etc, until all _d

We use an example to show this concept more clearly. Let us consider a case with n = 4, d = 2, and s = 2, where the variables to be substituted are x2 and x3. Now, the original system can be written in the following expression:

C_0,1x0x1+(C_0,2x2+C_0,3x3+C₀)x0+(C_1,2x2+C_1,3x3+C₁)x1+(C_2,3x2x3+C₂x2+C₃x3+C).

It can be inferred from the expression that the c’s form the range of C2,3x2x3 + C₂x2+ C₃x3+ C, the c₁’s form the range of C_1,2x2+ C_1,3x3+ C₁, . . . , and so on.

Note that the degree-d terms of all subsystems actually come from the original system. In other words, c_α₁_,...,α_d = C_α₁_,...,α_d for any subsystem. Thus, we do not need to evaluate c_α₁_,...,α_d’s by GGCE. Instead, we may simply copy from the original system those coeﬃcients when we need them.

Time complexity. According to our arguments and example, it is clear that coeﬃcients (in subsystems) of a degree-k monomial can be generated by running GGCE on a degree-(d − k) system deﬁned over the s substituted variables. Thus, the bit operations required (per equation) for partial evaluation can be expressed by:

Memory Issues. Since in partial evaluation, the generation of cs is done by running GGCE on a degree-d system with s variables, at most M_GGCE^(ro) (s, d) + M_GGCE^(rw) (s, d) bits per equation is required to complete the whole process. While this memory cost is usually aﬀordable, memory problems brought by partial evalu-ation are usually due to the subsystems it generates. To be exact, the subsystems

should take

bits per equation, where _n_−s

bits are for cα₁,...,α_d’s. Thus, when s is suﬃciently

large, the subsystems can take a huge amount of memory. However, sometimes we do not need all the subsystems at the same time, in which case there are at least two ways to mitigate the memory problem.

The ﬁrst solution is to perform a multi-level partial evaluation, which is suitable when only 2^s of all 2^s subsystems need to be dealt with at the same time. A two-level partial evaluation goes like this. First we divide the target system into 2^s−s

“intermediary” systems. Then, we pick one of them at a time and divide it into 2^s

“ﬁnal” systems which are meant to be processed together. In this way, as long as the intermediary systems do not take much memory, we only need the memory for the 2^s ﬁnal systems.

The second solution is to run all instances of GGCE (in partial evaluation) at the same time, which might be useful when the 2^s subsystems are dealt with one by one. Note that attempts with the same index of all instances actually generate all coeﬃcients of the same subsystem. Thus, by running all instances synchronously, we may generate the subsystems one by one. In fact, there is a one-to-one mapping between the terms in the original system and the terms in all coeﬃcient-generating systems (h’s); speciﬁcally, there is a one-to-one mapping between the highest-degree terms in coeﬃcient-generating systems and the degree-d terms in the original system.

Thus, by the discussions about memory issues in GGCE, we may conclude that this scheme uses the same amount of read-only and read-write memory with running GGCE directly on the original system.

Chapter 3 Variants and Analysis

3.1 Early-abort Strategy

3.1.1 In Na¨ıve Evaluation

While the na¨ıve evaluation has been proved to be outperformed by several solvers, it can actually be improved by taking advantage of an early-abort strategy. All we need to do is to treat the equations as a sequence of candidate ﬁlters, and each candidate vector inFⁿ₂ would be examined by the ﬁlters one by one until it is ﬁltered out. Let V⁽⁰⁾ = Fⁿ₂, the initial space of candidate vectors. Formally speaking, for each f⁽ⁱ⁾ we can compute f⁽ⁱ⁾(V⁽ⁱ⁾) and arrive at V⁽ⁱ⁺¹⁾ ={v ∈ V⁽ⁱ⁾ | f⁽ⁱ⁾(v) = 0}.

Since on average each candidate vector is ﬁltered out with probability 0.5, we only need to examine two ﬁlters on average. Consequently, the average number of bit operations required would be:

2· BEval(n, d) =

d i=0

n i

2ⁿ⁻ⁱ⁺¹. (3.1)

3.1.2 In GGCE

The way we treat equations as ﬁlters is apparently not suitable for GGCE, for it needs to enumerate all Fⁿ₂. However, can GGCE be modiﬁed so that it computes only f⁽ⁱ⁾(V⁽ⁱ⁾)? One method that has come across our mind goes like this. To

xxiv

compute f⁽ⁱ⁾(V⁽ⁱ⁾), f⁽ⁱ⁾ is ﬁrst partially evaluated with some well-chosen s⁽ⁱ⁾. Then, for each v∈ V⁽ⁱ⁾, f⁽ⁱ⁾(v) can be evaluated by substituting n − s bits of v into the corresponding subsystem.

Since the costs for partial evaluation and na¨ıve evaluation are known, the number of bit operations required for computing f⁽ⁱ⁾(V⁽ⁱ⁾) can be expressed by:

BP artial(n, d, s⁽ⁱ⁾) + 2⁽ⁿ⁻ⁱ⁾· BEval Sum Avg(n − s⁽ⁱ⁾, d),

where 2⁽ⁿ⁻ⁱ⁾ stands for the expected number of|V⁽ⁱ⁾|. According to this expression, the cost of computing f⁽ⁱ⁾(V⁽ⁱ⁾) is fully dependent of s⁽ⁱ⁾. Therefore, minimizing the total number of bit operations required can be simply done by ﬁnding the best s⁽ⁱ⁾ for each i independently. Unfortunately, after trying mathematical techniques such as the ﬁrst derivative test, we found it hard to express the best s⁽ⁱ⁾in a closed general form. Thus, we use an empirical approach instead, in which we search for the best s⁽ⁱ⁾ in the interval [0, n] for each i. The same procedure can be repeated several times for diﬀerent settings of (m, n, d) to gain enough generality. According to our experiment result, the sequence [s⁽⁰⁾, s⁽¹⁾, l . . . , s^(m−1)] is usually in the pattern of [n, n, n − k1, n − k1− 1, . . . , k2, 0, . . . , 0], where k1, k2 are some small positive integer.

This implies that [n, n, n − 1, n − 2, . . . , 1, 0, . . . , 0] might be generally a good choice.

Thus, we can approximate the total number of bit operations required with:

_m−1

We note that this upper bound may not be precise since the cost of initialization in GGCEs (in partial evaluation) is ignored from the line starting with “.” How-ever, when 32≤ m = n ≤ 64 and 2 ≤ d ≤ 4, we ﬁnd that this latter cost is indeed negligible in theory. Also remember that we assume |V⁽ⁱ⁾| = 2ⁿ⁻ⁱ, which might not be accurate for some cases.

In some sense, this scheme is a mix of GGCE (or partial evaluation) and na¨ıve evaluation. By choosing s⁽ⁱ⁾, we may determine the weights of the two methods in the computation of f⁽ⁱ⁾(V⁽ⁱ⁾). That is, when s⁽ⁱ⁾ is close to n, the scheme highly resembles GGCE. On the other hand, when s⁽ⁱ⁾ is close to 0, the scheme is more like the na¨ıve evaluation. Actually, this viewpoint is consistent with our experiment result: GGCE is more suitable for computation of f⁽ⁱ⁾(V⁽ⁱ⁾) when |V⁽ⁱ⁾| is close to 2ⁿ (or equivalently, when i is small), and vice versa.

The importance of this scheme resides in its ﬂexibility. This scheme not only con-tains the solvers described in Section 2.3, 2.5, and 3.1.1, but also allows time-memory trade-oﬀ. Furthermore, it can be easily adapted according to the implementation hardware platform.

Note that using diﬀerent s⁽ⁱ⁾ for each equation might not be suitable for general hardware platform such as GPUs and CPUs, for they lack the capability of eﬃcient handling of bit vectors of a wide variety of widths. For special devices such as FPGAs, however, the scheme might work well.

在文檔中解佈於二元體之多項式方程組之快速窮舉法 (頁 24-29)