Chien Search Improvement – II - Optimization on RS Decoder

4 Implementation and Optimization of 802.16a FEC Scheme on DSP Platform

4.3 Optimization on Reed-Solomon Code

4.3.2 Optimization on RS Decoder

4.3.2.4 Chien Search Improvement – II

After adding the early termination into the Chien search, we find that even in the best case, it still requires 57645 cycles to find the root (only one root!). We are confused at first because if the error occurs in the first symbol of the codeword, the Chien search should find the root at the first attempt and should not spend so much execution time.

After we carefully examine our program code, we finally find the reason. It is due to the fact that the RS code we use is a shortened RS code. Take the (48,36,6) RS code for example, when doing the RS decoding, 203 bytes of zero are inserted to the (48,36,6) codeword. That is to say, the first 203 symbols in the core (255,239,8) code will never be distorted by the channel noise. But if we still use the Chien search module which is originally designed for the (255,239,8) code, it always start finding the roots from the first symbol of the (255,239,8) code, which is never an error position. To reduce the redundant calculations due to this shortened code, we modify the original Chien search module to start searching roots from the first valid symbol, i.e., the 221^st symbol for the (24,18,3) RS code, the 213^th symbol for the (30,26,2) RS code, the 203^rd symbol for the (48,36,8) RS code, and the 185^th symbol for the (60,54,3) RS code. After the modification, the new profile of the Chien search module is shown in Table 4.9. As expected, the cycle count is greatly reduced compared with the original Chien search.

Type Cycles

Chien Search

(Worst Case) 34558 Chien Search

(Best Case) 1387

Average 17972

Table 4.9: Profile of the Worst Case and Best Case of Early Terminated Chien Search (Modified).

4.3.2.5 Inverse-Free Berlekamp Massey Algorithm

The iterative algorithm discovered by Berlekamp and Massey (BM algorithm) for decoding RS Code is a well-known technique for finding the errata locator polynomial.

Other decoding algorithms such as the Euclidean and continued fraction algorithm have a slightly higher complexity when compared with BM algorithm. However, the inversion of discrepancy needed in the computation of the original BM algorithm is complex and time-consuming due to the requirement of chain multiplications.

Fortunately, there is a new inverse-free BM algorithm proposed in [18]. We are not going to describe the new algorithm in detail but simply explain the operation of this algorithm and then compare it to the original algorithm.

The inverse-free algorithm is formulated as follows (Based on the modified inverse-free BM algorithm proposed in [19], which eliminates the need for pre-computing the Forney syndrome and post-computing the errata locator polynomial in [18]).

Compared to the original BM algorithm, it eliminates the use of inversion in the BM algorithm. To make sure this algorithm meets our needs, we perform an evaluation.

Table 4.10 shows the profile of the comparison between the original BM algorithm and the new inverse-free BM algorithm, whereas the notation A~E are the five galois field multiplier described previously. As we expected, the inverse-free BM algorithm is faster than the original one since it eliminates the need for calculating inversion of discrepancy. For the first three case, due to the high computational complexity on galois field multiplication, the inverse-free BM algorithm outperforms the original one significantly (around 5 times faster). However in the last two cases, the advantage of the inverse-free algorithm has decayed greatly. It is due to that the inversion for the last two cases is based on table lookup, which is very simple and does not drastically affect the total execution time.

Type Cycles (Original BM)

Cycles (Inverse-Free BM)

Improvement (%)

A 1185892 210616 463

B 555215 105753 425

C 628873 114714 448

D 30036 25175 19

E 16163 10913 48

Table 4.10: Comparison between the Original and the Inverse-Free BM Algorithm.

4.3.2.6 Compiler Level Improvements

Refer to the profile data given in Table 4.6, our next target in optimizing the RS decoder is focused on the syndrome calculator because 39% of the execution time is spent on it. As described in Chapter 2, the syndrome calculation structure is not complex. The cause that results in the slow processing speed is because of the massive substitutions of the field elements (α,α² α¹⁶ ) into the polynomial, which is composed of the received data sequence as its coefficients. The compiler’s feedback of

the original syndrome calculator is shown in Fig. 4.14. Similar to the case in encoder part, we notice that in the last line it reports “Disqualified loop: Loop contains a call”.

From the pseudo code of the syndrome calculator loop shown in Fig. 4.15, we know that it is due to the calling of galois field multiplier “gmpy( )”. As described above, we also turn on the “File” level optimization to allow the compiler to deal with the function calls inside a loop. Fig. 4.16 shows the compiler’s feedback after we set the “Opt.

Level” to “File” level. It finally schedules the loop into a software pipelined loop, which runs 2 iterations in parallel with each iteration completed in 13 cycles.

Figure 4.14: Compiler’s Feedback for Syndrome Calculator Loop.

Figure 4.15: Pseudo Code for Syndrome Calculator.

The second compiler level improvement we have done is on the Chien search function. Refer to Fig. 4.12 (b), we find that the trip count of the first inner loop depends on the deg_lambda, which is calculated by the BM algorithm and thus cannot be determined in the compiling stage. As a result of it, the compiler cannot decide how many times the loop will be executed and hence the flexibility for the compiler to arrange the resources is limited. To break this limit, we carefully examine the loop

;*---*

;* SOFTWARE PIPELINE INFORMATION

;* Disqualified loop: Loop contains a call

;*---*

for (i = 0; i < nn; i++) {

for (j = 1; j <= no_p; j++) {

product = gmpy(Alpha_to[B0-1+j],s[j]);

s[j] = product ^ data[i];

} }

By using the pragma “MUST_ITERATE(min. trip count)”, we can send the loop information to the compiler for the purpose of giving the compiler more information available in compile stage and thus the efficiency of the Chien search is improved.

Figure 4.16: Compiler’s Feedback for Syndrome Calculator Loop (After Build Option Change).

4.4 Optimization on Convolutional Code

Similar to the case for the RS code, we start the optimization from the convolutional encoder part, but soon we find that the original speed of the convolutional encoder is sufficient (around 11Mbits/sec). The structure of the encoder is simple, so we skip optimizing the convolutional encoder but do the optimization on the Viterbi

4.4.1 Optimization on Viterbi Decoder

4.4.1.1 Choose Appropriate Data Types for Branch Metric

The same case as in optimization for the RS codes, we first use the CCS built-in profiler to analyze the Viterbi decoder. Table 4.11 shows the original profile of the Viterbi decoder. This profile is obtained from the Viterbi decoder using the floating- point values for branch metric, which is mainly for the purpose of soft-decision decoding.

Areas Code

Size

Cycles Percentage (%)

Processing Rate (Kbits/sec)

Main Function 1536 8393747 100 27

Initialize_State 892 335 0 VD_Decode 960 8336355 100

Table 4.11: Original Profile of Viterbi Decoder.

However, the speed performance is awful if we still use the floating-point numbers for branch metrics on DSP, since our DSP is a fixed-point processor. To do floating-point arithmetics, it uses multiple fixed-point instructions to simulate a floating-point operation. So in the first step on Viterbi decoder optimization, we try to convert the floating-point values to fixed-point values. Of course, this conversion does cause loss in preciseness, but since what we really care is the relative values of the branch metrics, not the absolute value of each branch metric, the conversion does not hurt the performance strongly. The conversion we have done is simply multiply the original floating-point value by 1000 then round it to integer. The profile of the Viterbi decoder

Areas Code Size

Cycles Percentage (%)

Processing Rate (Kbits/sec)

Main Function 932 616349 100 374

Initialize_State 892 196 0

VD_Decode 944 559879 100

Table 4.12: Profile of Viterbi Decoder Using Fixed Point Value As Branch Metric.

4.4.1.2 Modified Path Recording - I

Table 4.13 shows the profile of the VD_Decode function in Table 4.12, we thus know that the ACS (Add-Compare-Select) computation takes 79% of the total execution time. To increase the Viterbi decoder efficiency, we concentrate on the ACS computation. As described in Chapter 2, the ACS consists of 32 butterfly structures;

each butterfly can produces 2 state metrics, so totally we can obtain the 64 state metrics that we used to expand the trellis. However, unlike the case of RS decoder, a single ACS computation is simple and regular; the major reason that leads to slow operating speed is due to the massive computations, i.e., we have to do 64 ACS computation for only 1 output bit. Thus the key to accelerate the Viterbi decoder on the DSP platform is to make the program code well match the features of the CCS compiler. In other words, instead of trying to explore a new algorithm to perform the convolutional decoding, we choose to refine the C code to make it well software-pipelined by the CCS compiler, or

Areas Code

Size

Cycles Percentage (%)

ACS 320 401665 79

Others

(Metric Setup, Traceback)

712 105986 21

Table 4.13: Profile of VD_Decode Function.

if possible we avoid using the instructions which require long time to complete such as loading, storing or branching.

Firstly, reference to [7], we know that the register file is limited so we are forced to store the state metric in the internal data memory. However, for recording the optimal path, we only need to write down “0” or “1” to represent the results in our path tracking procedure since we exploit the butterfly structure in the ACS calculation. If we just record the information at bit level, we can store it in an internal register first and then put it in the internal data memory when the register is full. Hence, we can reduce the frequency on memory access. The revision is shown as the pseudo code in Fig. 4.17.

We see that the path is recorded in the pointer that points to the address in the internal

Figure 4.17: Pseudo Code for Recording Path in Internal Data Memory and Register.

data memory as shown on the left side of Fig. 4.17 while the path is recorded in the internal register as shown on the right side of Fig. 4.17. According to Table 4.1, the store instruction is in the three-cycle instruction category, so we can expected that the speed will be improved by using the register to temporarily store the path information.

/* Internal Data Memory */

need to do memory accessing once for every 8 bits (path information) are recorded. In order to improve the speed further, we want to reduce the memory access time as much as possible. In the previous section, we use the register declared as “unsigned char” to record the path information, the size of the “unsigned char” is 8 bit. So every 8 bits (path information) we need to do memory accessing or the old path information will lose due to overflow. In this section, we further declare the register used to record the path information as “unsigned int”, which is the largest data type that can store bit information and do not use extra instructions to load/store the register. That is, the size of “long” is larger than “int” but requires extra instructions to load/store value from it.

By using the “unsigned int” register, the frequency of memory accessing is reduced from 1 time per 8 bit to 1 time per 32 bit and thus the speed is improved further. The profile data of these two optimization are shown later in this chapter.

4.4.1.4 Counter Splitting

Refer to the CCS compiler’s feedback which is shown in Fig. 4.18, we notice that the core loop of the ACS computation is not software-pipelined well since it requires 17 cycles per iteration. It is too slow as a key loop of our Viterbi decoder. By carefully examining the feedback information, we find the problem is due to the strong dependency between the instructions inside the loop and the dependency mainly comes from the counter i and i/2 used in the butterfly structure for ACS computation. To break the dependency between these two counters, we exploit the relationship between these two counters which is shown in Fig. 4.19 that each time when i is increased by 2, i/2 is increased by 1. That is to say, we can declare a new counter named j, which initializes as zero and is increased by 1 at the end of each iteration. This counter j is actually equivalent to i/2 but do not have the dependency with counter i since we do not declare j

= i/2. The compiler’s feedback after counter splitting is shown in Fig. 4.20, from these figures we know that the loop is software-pipelined better after we split the counter manually. Finally, the loop is software-pipelined as 3 iterations run in parallel and each

is completed in 7 cycles, while the original one is only software-pipelined as 2 iterations run in parallel and each is completed in 17 cycles.

Figure 4.18: Compiler’s Feedback for ACS Loop.

;* SOFTWARE PIPELINE INFORMATION

Figure 4.20: Compiler’s Feedback for ACS Loop (After Counter Splitting).

4.4.1.5 Removal of Replicated Metrics

In this subsection, we remove the redundancy in the ACS loop to make it faster.

Refer to the pseudo code shown on the left side of Fig. 4.21, which shows the operation for updating the state metric value. This operation must be performed after finishing 64 ACS computations for updating the old state metric with the newly computed state metric. This code can be eliminated by manually unrolling the ACS loop into 2 sub-loops as shown on the right side pseudo code of Fig. 4.21. These two sub-loops executed in turns is equivalent to the operation of updating state metrics. In addition, by manually unroll the ACS loop, the instruction level parallelism is also increased and thus the compiler can pipeline the ACS loop even better than the original program.

;* SOFTWARE PIPELINE INFORMATION

Figure 4.21: Pseudo Code For Removing Replicated Metrics.

4.5 Simulation Results

In this section, we present some simulation profiles generated by the CCS built-in profiler for the FEC scheme in IEEE 802.16a. The results of each optimization step described formerly are also shown in the these simulation profiles. So, we can understand how much improvement can be obtained by our optimizations. At the end, the overall profiles of the FEC encoder and FEC decoder for the four required coding scheme defined in the IEEE 802.16a standard are also shown for evaluating the processing rate of our improved FEC programs.

4.5.1 Simulation Profile for RS Encoder

Table 4.14 shows the simulation profile of our RS encoder for encoding 36 bytes data. The profile can be categorized into 5 areas. The first one is the simulation result for original program. The second one is the result for modification on data type. The

/* Removing Replica by Loop Unrolling */

For odd iterations

/* Removing Replica by Loop Unrolling */

For odd iterations

/* Removing Replica by Loop Unrolling */

For odd iterations

/* Removing Replica by Loop Unrolling */

For odd iterations

perform galois field multiplication in the RS encoding procedure. The fifth one is the result for modification on build option. We set it to the “File” level optimization and we can also inline the galois field multiplier function code inside the encoding function to reduce the huge overhead for calling it frequently. The label “I/O Included” in the table means the execution time spent on I/O operation using fread( ) and fwrite( ) is included in the cycle count. If the streaming mechanism of the Quixote baseboard functions correctly, the execution time spent on I/O operation should be different from using fread( ) and fwrite( ) (hopefully better than using fread ( ) and fwrite( )). For reference, the profile which excludes the execution time spent on I/O operations is shown in

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 2928 1433434 120 N/A

Data Type

Modification 2284 1265024 136 13

Mastrovito

Multiplier 3056 678174 254 86

Serial Multiplier 2100 766120 225 65

Logarithmic Table

Lookup Multiplier 5692 137394 1257 824

Intrinsic

Multiplier 1464 77799 2221 76

Compiler Level

Optimization 1848 8036 21503 868

Table 4.14: Profile of Reed-Solomon Encoder (I/O Included).

Table 4.15. From the simulation profile, we know that the most efficient modification on the RS encoder is the replacement of original galois field multiplier by TI C64x intrinsic galois field multiplier. But if we implement the RS encoder on other DSP board which does not have an intrinsic galois field multiplier, then the most efficient modification shall be the replacement of original galois field multiplier by the

logarithmic table lookup multiplier. The improved processing rate of our RS encoder is about 21Mbits/sec with I/O included or 37Mbits/sec with I/O excluded.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 2928 1430005 120 N/A

Data Type

Modification 2284 1261595 136 13

Mastrovito

Multiplier 3056 674745 256 88

Serial Multiplier 2100 763689 226 66

Logarithmic Table

Lookup Multiplier 5692 133954 1289 847

Intrinsic

Multiplier 1464 74371 2323 80

Compiler Level

Optimization 1848 4607 37508 1514

Table 4.15: Profile of Reed-Solomon Encoder (I/O Excluded).

4.5.2 Simulation Profile for RS Decoder

Tables 4.16 and 4.17 show the simulation profile of our RS decoder with I/O included and with I/O excluded, respectively. This profile is generated when decoding 48 bytes received data. Besides the intrinsic multiplier, the second best improvement is with the Chien search modification. In the last step, we do further improvements based on compiler level tuning. It includes setting the “Opt. Level” in CCS build option from

“Function” to “File”, using pragma MUST_ITERATE to provide the minimum trip count information for compiler and inline the functions for the frequently used functions

Optimization Step Code Size Cycles Processing Rate

Table 4.16: Profile of Reed-Solomon Decoder (I/O Included).

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Table 4.17: Profile of Reed-Solomon Decoder (I/O Excluded).

4.5.3 Simulation Profile for CC Encoder

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 768 20596 11186 N/A

Table 4.18: Profile of Convolutional Encoder (I/O Included).

Table 4.18 shows the simulation profile of our convolutional encoder for encoding 48 bytes data, the original processing rate is about 11Mbits/sec. Table 4.19 shows the simulation profile which excludes the execution time spent on I/O operation, the processing rate is about 15Mbits/sec. This processing speed satisfies our requirements.

There is no need for further improvement currently.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 768 15325 15034 N/A

Table 4.19: Profile of Convolutional Encoder (I/O Excluded).

4.5.4 Simulation Profile for CC Decoder

Table 4.20 shows the simulation profile of our soft decision decoding Viterbi decoder for decoding 72 bytes received data (4608 input bytes actually, since each received data bit is represented by two integer (32-bit) branch metrics for soft decision decoding). Similar to the case of RS code, the profile which excludes the execution time spent on I/O operation is shown in Table 4.21 for a reference. From these two tables, we can see that the most significant improvement is on the fixed point modification because our DSP is designed for fixed-point calculation.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 3388 8393747 27 N/A

Fixed Point 2856 616349 374 1285

Path Recording I 2828 446506 516 38

Path Recording II 3020 355547 648 26

The second best improvement is on the counter splitting because it enables the software-pipeline working much better on the key loop (ACS loop) of our Viterbi decoder. Finally, we can achieve a processing rate of 1101 Kbits/sec with I/O included or 1514 Kbits/sec with I/O excluded.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 3388 8336690 28 N/A

Fixed Point 2856 560080 411 1367

Path Recording I 2828 392896 586 43

Path Recording II 3020 298492 772 32

Counter Splitting 3284 179976 1280 66

Removal of

Metric Replica 3896 152206 1514 18

Table 4.21: Profile of Soft Decision Decoding Viterbi Decoder (I/O Excluded).

4.5.5 Simulation Profile for FEC Encoder

In this subsection, we show the improved profile of our FEC encoder, which concatenates the RS encoder and the convolutional encoder. After encoding around 100 bytes data (108 bytes for scheme 1, 3, 4; 104 bytes for scheme 2 to meet the code specification), Table 4.22 shows the average improved profile for the four required

Modulation RS Code CC Code

Rate Code Size Cycles W/ W/O

Processing Rate (Kbits/sec)

W/ W/O

QPSK (24,18,3) 2/3 2384 86762 77938 5975 6651 QPSK (30,26,2) 5/6 2464 56312 48758 8865 10238 16-QAM (48,36,6) 2/3 2388 61677 54044 8405 9592 16-QAM (60,54,3) 5/6 2476 59638 52854 8692 9808

Table 4.22: Profile of Forward Error Correction Encoder.

coding scheme defined in the IEEE 802.16a standard, whereas the “W/” and “W/O”

在文檔中 IEEE 802.16a標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化 (頁 77-0)