Galois Field Multiplication - Optimization on RS Encoder

4 Implementation and Optimization of 802.16a FEC Scheme on DSP Platform

4.3 Optimization on Reed-Solomon Code

4.3.1 Optimization on RS Encoder

4.3.1.2 Galois Field Multiplication

Table 4.3: Profile of Revised RS Encoder (Data Type Modification).

Figure 4.6: Pseudo Code for Variable Using Long and Int Data Type.

4.3.1.2 Galois Field Multiplication

Refer to Table 4.3 shows the profile of our RS encoder program after the data type modification. As expected in Chapter 2, we observe that 96% execution time is

already plenty of methods proposed [4], [5], [6] to accelerate the galois field multiplication for either hardware or software implementation. In order to find an appropriate method for our DSP implementation, we do some evaluations on these three proposed methods. They are Mastrovito multiplier method, serial multiplier method and logarithmic table lookup method.

As proposed in [4], the Mastrovito algorithm is used to perform multiplication in the ground field GF(2^m) (in our case, m = 8). Before going through the algorithm, we first introduce the polynomial notation for galois field multiplication equation

A(y)B(y) = C(y) mod Q(y) performing the degree 8 polynomial multiplication and modulo operation on DSP, which is the case of original version galois field multiplier, it will result in a very slow galois field multiplier. This is because other than the 8 by 8 bit multiplication, it also requires plenty of branch instructions every time when doing the modulo operation, and from the pipeline structure we described above, branch instructions requires 6 execution phases to destination instruction. Hence, it is more time consuming comparatively.

Alternatively, Mastrovito has proposed an algorithm to speed up the galois field multiplication. First, the GF(2⁸) elements B(y) and C(y) can be represented as column vectors consisting of the binary polynomial coefficients. By introducing a “product matrix” Z = f(A(y),Q(y)), the galois field multiplication can be described as

The coefficients f_i_,_i∈GF(2) of the product matrix depend recursively on the coefficients a_i of polynomial A(y) and q_i of the matrix Q which is derived by the binary field polynomial Q(y)

where the elements qi,j of the matrix Q is defined by

= version multiplier, and hence should be faster than the original one.

The second method proposed [5] is known as the serial multiplier which is originally designed for the implementation of the public-key cryptosystems that requires programmable multipliers in large galois fields. The algorithm of this multiplier is also derived from the basic galois field multiplication equation

A(y)B(y) = C(y) mod Q(y).

From this equation, we know that GF(2⁸) multiplication can be carried out by multiplying A(y) and B(y) and then performing the reduction modulo Q(y). But there is also an alternative way to do the same thing – By interleaving multiplication and reduction according to the equation

)

In this equation, C^’(i) represent the partial results generated at step i of the recursion.

The a0 ~ a7 are the binary coefficients of A(y). And the products xW^’(i-1) are polynomials of degree k, which must be reduced modulo Q(y).

These reductions are done using the following identity.

) multiplications and 7 polynomial reductions. The GF(2⁸) serial multiplier, sometimes

referred to as “MSB-First multiplier,” is a polynomial basis multiplier that use 8 slices and computes GF(2⁸) in 8 cycles. It is based on the algorithm described in the previous paragraph. To make it more clearly, the algorithm is rewritten and shown in Fig. 4.7, and its hardware realization is given in Fig. 4.8 for a reference purpose. This serial multiplier method is attractive for VLSI implementation, but for DSP implementation, it also has some advantages because it provides a parallel architecture for the galois field multiplier. It eliminates the complex branch instruction which is required in the original galois field multiplier in doing reduction modulo. Hence, the CCS compiler can perform the software pipelining more efficiently.

Figure 4.7: Algorithm for Serial Multiplier.

The third method proposed [6] is the logarithmic table lookup method. It is a well-known method for computing GF(2ⁿ) arithmetic (both multiplication, squaring and inversion) for small values of n. In our case, the galois field is GF(2⁸). So, a primitive element g∈GF(2⁸) is selected to serve as the generator of the field GF(2⁸). Thus, an

Figure 4.8: Serial Multiplier in GF(2⁸).

255

0≤ i≤ . Then, we compute the powers of the primitive element, gⁱ for i = 0,1,…,255, and then obtain 256 pairs of the form (A(y),i). Afterward, we can construct two tables that sorting these 256 pairs in two different ways: the log table is sorted with respect to A(y) and the alog table is sorted with respect to i. For example, for i = 26 and A(y) = g²⁶, we have log[A(y)] = 26 and alog[26] = A(y). These tables are then stored in the DSP internal memory and they are accessed when performing the field multiplication, the squaring, and the inversion operations. Given two elementsA(y),B(y)∈GF(2⁸), we perform the multiplication C(y) = A(y)B(y) mod Q(y) as follows

1. a := log[A(y)]

2. b := log[B(y)]

3. c := a + b (mod 255) 4. C(y) := alog[c]

This is due to the fact that C(y) = A(y) x B(y) = gⁱg^j = gi+j mod 255. The ground field multiplication requires three memory access, a single modular addition operation with modulus 255. The squaring of an element A(y) is slightly easier: only two memory access operations are required for computing C(y) = A(y)², as illustrated below

1. a := log[A(y)]

2. c := 2*a (mod 255) 3. C(y) := alog[c]

Similarly, the inversion of an element A(y) is computed using the property C(y) = A(y)^-1 = g^-i = g^255-i, which requires two memory access operations

1. a := log[A(y)]

2. c := 255 – i 3. C(y) := alog[a]

This method has the advantage of low computation complexity. It only requires integer addition and a modulo 255 operation to perform either galois field multiplication or inversion or squaring. The cost of this method is that it requires more memory accessing compared to the previous two methods and it occupies larger and larger memory space when the values of n grows up. But since in our application, the value of n is only 8, the memory space shall not be a serious problem here.

Besides the above three methods, we also try to find some useful instructions or special architecture for GF(2⁸) multiplication that can well fit into the DSP hardware.

Fortunately, we find that the C64x series DSP chips provide a special intrinsic function to perform the GF(2⁸) (and for GF(2⁸) only!) multiplication. The intrinsic function format is (unsigned int) _gmpy4(unsigned int A, unsigned int B). This function is capable of doing four GF(2⁸) multiplication simultaneously, but before performing the four simultaneous multiplication, we have to packet the four 8-bit galois field elements into a 32-bit register, and the packaging operation also consume execution time. Overall, it does not provide benefit if we need to packet the 8-bit galois field elements into a 32-bit register then perform one _gmpy4 intrinsic instruction. Therefore, we decide to perform only one galois field multiplication each time we call this intrinsic function, not four simultaneous multiplication.

C refers to the serial multiplier; the multiplier D refers to the logarithmic table lookup multiplier, and the multiplier E refers to the intrinsic galois field multiplier provided by the TI C64x series DSP chips. The notation “one mult” in the cycles column denotes that it is the cycle count for performing one galois field multiplication. From this table, we can observe that the multiplier B has the largest code size. We think it is due to the build-up of the “product matrix”, but actually, the most memory space consuming multiplier is the multiplier D because it has to store two tables each contains 256 elements. The code size (4204) denotes the total size of the two tables plus the multiplier. The most efficient multiplier is the multiplier E. The C64x series DSP chips may contain an application-specified hardware structure for computing galois field multiplication. This conjecture is based on the evidence that the CCS compiler directly translate the _gmpy4 intrinsic function to an assembly instruction named GMPY4.

Therefore, there is an assembly instruction used for the galois field multiplication.

Likely, there is a specific hardware to perform this task.

To make our program platform independent, our attempt is to seek for an appropriate algorithm among B, C and D. As one may expect, the performance of any of these three multipliers shall not exceed the TI’s intrinsic multiplier since it is accelerated by TI’s hardware. We find that the logarithmic table lookup multiplier performance is still pretty good even compared with the intrinsic one. It means that a software-oriented algorithm is more appropriate for DSP implementation than a hardware-oriented algorithm. However, if we implement the hardware-oriented algorith-

Multiplier Type Code Size Cycles (One Mult)

GF_Multiplier A 584 292

GF_Multiplier B 1080 167

GF_Multiplier C 456 189

GF_Multiplier D 88 (4204) 22

GF_Multiplier E 12 6

Table 4.4: Comparison of the Five Different Galois Field Multiplier.

m on the built-in FPGA of Quixote DSP baseboard, we may have a totally different conclusion. The profile of simulation results for each revision step will be shown in section 4.5.

在文檔中 IEEE 802.16a標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化 (頁 63-70)