Modified Path Recording – I - Optimization on Viterbi Decoder

4 Implementation and Optimization of 802.16a FEC Scheme on DSP Platform

4.4 Optimization on Convolutional Code

4.4.1 Optimization on Viterbi Decoder

4.4.1.2 Modified Path Recording – I

Processing Rate (Kbits/sec)

Main Function 932 616349 100 374

Initialize_State 892 196 0

VD_Decode 944 559879 100

Table 4.12: Profile of Viterbi Decoder Using Fixed Point Value As Branch Metric.

4.4.1.2 Modified Path Recording - I

Table 4.13 shows the profile of the VD_Decode function in Table 4.12, we thus know that the ACS (Add-Compare-Select) computation takes 79% of the total execution time. To increase the Viterbi decoder efficiency, we concentrate on the ACS computation. As described in Chapter 2, the ACS consists of 32 butterfly structures;

each butterfly can produces 2 state metrics, so totally we can obtain the 64 state metrics that we used to expand the trellis. However, unlike the case of RS decoder, a single ACS computation is simple and regular; the major reason that leads to slow operating speed is due to the massive computations, i.e., we have to do 64 ACS computation for only 1 output bit. Thus the key to accelerate the Viterbi decoder on the DSP platform is to make the program code well match the features of the CCS compiler. In other words, instead of trying to explore a new algorithm to perform the convolutional decoding, we choose to refine the C code to make it well software-pipelined by the CCS compiler, or

Areas Code

Size

Cycles Percentage (%)

ACS 320 401665 79

Others

(Metric Setup, Traceback)

712 105986 21

Table 4.13: Profile of VD_Decode Function.

if possible we avoid using the instructions which require long time to complete such as loading, storing or branching.

Firstly, reference to [7], we know that the register file is limited so we are forced to store the state metric in the internal data memory. However, for recording the optimal path, we only need to write down “0” or “1” to represent the results in our path tracking procedure since we exploit the butterfly structure in the ACS calculation. If we just record the information at bit level, we can store it in an internal register first and then put it in the internal data memory when the register is full. Hence, we can reduce the frequency on memory access. The revision is shown as the pseudo code in Fig. 4.17.

We see that the path is recorded in the pointer that points to the address in the internal

Figure 4.17: Pseudo Code for Recording Path in Internal Data Memory and Register.

data memory as shown on the left side of Fig. 4.17 while the path is recorded in the internal register as shown on the right side of Fig. 4.17. According to Table 4.1, the store instruction is in the three-cycle instruction category, so we can expected that the speed will be improved by using the register to temporarily store the path information.

/* Internal Data Memory */

need to do memory accessing once for every 8 bits (path information) are recorded. In order to improve the speed further, we want to reduce the memory access time as much as possible. In the previous section, we use the register declared as “unsigned char” to record the path information, the size of the “unsigned char” is 8 bit. So every 8 bits (path information) we need to do memory accessing or the old path information will lose due to overflow. In this section, we further declare the register used to record the path information as “unsigned int”, which is the largest data type that can store bit information and do not use extra instructions to load/store the register. That is, the size of “long” is larger than “int” but requires extra instructions to load/store value from it.

By using the “unsigned int” register, the frequency of memory accessing is reduced from 1 time per 8 bit to 1 time per 32 bit and thus the speed is improved further. The profile data of these two optimization are shown later in this chapter.

4.4.1.4 Counter Splitting

Refer to the CCS compiler’s feedback which is shown in Fig. 4.18, we notice that the core loop of the ACS computation is not software-pipelined well since it requires 17 cycles per iteration. It is too slow as a key loop of our Viterbi decoder. By carefully examining the feedback information, we find the problem is due to the strong dependency between the instructions inside the loop and the dependency mainly comes from the counter i and i/2 used in the butterfly structure for ACS computation. To break the dependency between these two counters, we exploit the relationship between these two counters which is shown in Fig. 4.19 that each time when i is increased by 2, i/2 is increased by 1. That is to say, we can declare a new counter named j, which initializes as zero and is increased by 1 at the end of each iteration. This counter j is actually equivalent to i/2 but do not have the dependency with counter i since we do not declare j

= i/2. The compiler’s feedback after counter splitting is shown in Fig. 4.20, from these figures we know that the loop is software-pipelined better after we split the counter manually. Finally, the loop is software-pipelined as 3 iterations run in parallel and each

is completed in 7 cycles, while the original one is only software-pipelined as 2 iterations run in parallel and each is completed in 17 cycles.

Figure 4.18: Compiler’s Feedback for ACS Loop.

;* SOFTWARE PIPELINE INFORMATION

Figure 4.20: Compiler’s Feedback for ACS Loop (After Counter Splitting).

4.4.1.5 Removal of Replicated Metrics

In this subsection, we remove the redundancy in the ACS loop to make it faster.

Refer to the pseudo code shown on the left side of Fig. 4.21, which shows the operation for updating the state metric value. This operation must be performed after finishing 64 ACS computations for updating the old state metric with the newly computed state metric. This code can be eliminated by manually unrolling the ACS loop into 2 sub-loops as shown on the right side pseudo code of Fig. 4.21. These two sub-loops executed in turns is equivalent to the operation of updating state metrics. In addition, by manually unroll the ACS loop, the instruction level parallelism is also increased and thus the compiler can pipeline the ACS loop even better than the original program.

;* SOFTWARE PIPELINE INFORMATION

Figure 4.21: Pseudo Code For Removing Replicated Metrics.

4.5 Simulation Results

In this section, we present some simulation profiles generated by the CCS built-in profiler for the FEC scheme in IEEE 802.16a. The results of each optimization step described formerly are also shown in the these simulation profiles. So, we can understand how much improvement can be obtained by our optimizations. At the end, the overall profiles of the FEC encoder and FEC decoder for the four required coding scheme defined in the IEEE 802.16a standard are also shown for evaluating the processing rate of our improved FEC programs.

4.5.1 Simulation Profile for RS Encoder

Table 4.14 shows the simulation profile of our RS encoder for encoding 36 bytes data. The profile can be categorized into 5 areas. The first one is the simulation result for original program. The second one is the result for modification on data type. The

/* Removing Replica by Loop Unrolling */

For odd iterations

/* Removing Replica by Loop Unrolling */

For odd iterations

/* Removing Replica by Loop Unrolling */

For odd iterations

/* Removing Replica by Loop Unrolling */

For odd iterations

perform galois field multiplication in the RS encoding procedure. The fifth one is the result for modification on build option. We set it to the “File” level optimization and we can also inline the galois field multiplier function code inside the encoding function to reduce the huge overhead for calling it frequently. The label “I/O Included” in the table means the execution time spent on I/O operation using fread( ) and fwrite( ) is included in the cycle count. If the streaming mechanism of the Quixote baseboard functions correctly, the execution time spent on I/O operation should be different from using fread( ) and fwrite( ) (hopefully better than using fread ( ) and fwrite( )). For reference, the profile which excludes the execution time spent on I/O operations is shown in

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 2928 1433434 120 N/A

Data Type

Modification 2284 1265024 136 13

Mastrovito

Multiplier 3056 678174 254 86

Serial Multiplier 2100 766120 225 65

Logarithmic Table

Lookup Multiplier 5692 137394 1257 824

Intrinsic

Multiplier 1464 77799 2221 76

Compiler Level

Optimization 1848 8036 21503 868

Table 4.14: Profile of Reed-Solomon Encoder (I/O Included).

Table 4.15. From the simulation profile, we know that the most efficient modification on the RS encoder is the replacement of original galois field multiplier by TI C64x intrinsic galois field multiplier. But if we implement the RS encoder on other DSP board which does not have an intrinsic galois field multiplier, then the most efficient modification shall be the replacement of original galois field multiplier by the

logarithmic table lookup multiplier. The improved processing rate of our RS encoder is about 21Mbits/sec with I/O included or 37Mbits/sec with I/O excluded.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 2928 1430005 120 N/A

Data Type

Modification 2284 1261595 136 13

Mastrovito

Multiplier 3056 674745 256 88

Serial Multiplier 2100 763689 226 66

Logarithmic Table

Lookup Multiplier 5692 133954 1289 847

Intrinsic

Multiplier 1464 74371 2323 80

Compiler Level

Optimization 1848 4607 37508 1514

Table 4.15: Profile of Reed-Solomon Encoder (I/O Excluded).

4.5.2 Simulation Profile for RS Decoder

Tables 4.16 and 4.17 show the simulation profile of our RS decoder with I/O included and with I/O excluded, respectively. This profile is generated when decoding 48 bytes received data. Besides the intrinsic multiplier, the second best improvement is with the Chien search modification. In the last step, we do further improvements based on compiler level tuning. It includes setting the “Opt. Level” in CCS build option from

“Function” to “File”, using pragma MUST_ITERATE to provide the minimum trip count information for compiler and inline the functions for the frequently used functions

Optimization Step Code Size Cycles Processing Rate

Table 4.16: Profile of Reed-Solomon Decoder (I/O Included).

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Table 4.17: Profile of Reed-Solomon Decoder (I/O Excluded).

4.5.3 Simulation Profile for CC Encoder

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 768 20596 11186 N/A

Table 4.18: Profile of Convolutional Encoder (I/O Included).

Table 4.18 shows the simulation profile of our convolutional encoder for encoding 48 bytes data, the original processing rate is about 11Mbits/sec. Table 4.19 shows the simulation profile which excludes the execution time spent on I/O operation, the processing rate is about 15Mbits/sec. This processing speed satisfies our requirements.

There is no need for further improvement currently.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 768 15325 15034 N/A

Table 4.19: Profile of Convolutional Encoder (I/O Excluded).

4.5.4 Simulation Profile for CC Decoder

Table 4.20 shows the simulation profile of our soft decision decoding Viterbi decoder for decoding 72 bytes received data (4608 input bytes actually, since each received data bit is represented by two integer (32-bit) branch metrics for soft decision decoding). Similar to the case of RS code, the profile which excludes the execution time spent on I/O operation is shown in Table 4.21 for a reference. From these two tables, we can see that the most significant improvement is on the fixed point modification because our DSP is designed for fixed-point calculation.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 3388 8393747 27 N/A

Fixed Point 2856 616349 374 1285

Path Recording I 2828 446506 516 38

Path Recording II 3020 355547 648 26

The second best improvement is on the counter splitting because it enables the software-pipeline working much better on the key loop (ACS loop) of our Viterbi decoder. Finally, we can achieve a processing rate of 1101 Kbits/sec with I/O included or 1514 Kbits/sec with I/O excluded.

Optimization Step Code Size Cycles Processing Rate (Kbits/sec)

Improvement (%)

Original 3388 8336690 28 N/A

Fixed Point 2856 560080 411 1367

Path Recording I 2828 392896 586 43

Path Recording II 3020 298492 772 32

Counter Splitting 3284 179976 1280 66

Removal of

Metric Replica 3896 152206 1514 18

Table 4.21: Profile of Soft Decision Decoding Viterbi Decoder (I/O Excluded).

4.5.5 Simulation Profile for FEC Encoder

In this subsection, we show the improved profile of our FEC encoder, which concatenates the RS encoder and the convolutional encoder. After encoding around 100 bytes data (108 bytes for scheme 1, 3, 4; 104 bytes for scheme 2 to meet the code specification), Table 4.22 shows the average improved profile for the four required

Modulation RS Code CC Code

Rate Code Size Cycles W/ W/O

Processing Rate (Kbits/sec)

W/ W/O

QPSK (24,18,3) 2/3 2384 86762 77938 5975 6651 QPSK (30,26,2) 5/6 2464 56312 48758 8865 10238 16-QAM (48,36,6) 2/3 2388 61677 54044 8405 9592 16-QAM (60,54,3) 5/6 2476 59638 52854 8692 9808

Table 4.22: Profile of Forward Error Correction Encoder.

coding scheme defined in the IEEE 802.16a standard, whereas the “W/” and “W/O”

represent “with I/O” and “without I/O”, respectively. Finally, the average processing rate of the FEC encoder can reach 7984 Kbits/sec with I/O included or 9072 Kbits/sec with I/O excluded after improvement.

4.5.6 Simulation Profile for FEC Decoder

Table 4.23 shows the average improved profile obtained in decoding the coded data generated in the above simulation. We find that the average processing rate of the FEC decoder can achieve 750 Kbits/sec with I/O included or 960 Kbits/sec with I/O excluded after improvement.

Modulation RS Code CC Code

Rate Code Size Cycles W/ W/O

Processing Rate (Kbits/sec)

W/ W/O

QPSK (24,18,3) 2/3 12836 815093 650539 636 797 QPSK (30,26,2) 5/6 12960 673042 535694 742 932 16-QAM (48,36,6) 2/3 12652 696082 533861 745 971 16-QAM (60,54,3) 5/6 12936 591806 456128 876 1137

Table 4.23: Profile of Forward Error Correction Decoder.

Chapter 5 ACS Unit Acceleration by Employing Xilinx FPGA as an Assistant

Based on the simulation results discussed in Chapter 4, we know the speed bottleneck of our FEC program is the soft decision decoding Viterbi decoder.

Furthermore, we know that the most time consuming kernel in the Viterbi decoder is the Add Compare Select (ACS) unit. In order to speed up the ACS unit, we have done some optimization on DSP platform. However, the final speed is still slower than we wish. The reason for the slow operating speed of the ACS unit is the massive sequential computations required to obtain a single output bit; i.e., it requires 64 ACS computations to obtain a single bit output. We notice that the ACS unit is suitable for the FPGA implementation, on which we can design and allocate as many functional units as we want as long as it does not exceed the area limit of the FPGA. Clearly, we can accelerate the ACS unit based on the Xilinx FPGA XC2V2000, which is embedded on the Quixote DSP baseboard, by simply placing 64 ACS units and make them operated in parallel on FPGA. And then integrated it with the original DSP program to make the overall speed performance of our FEC decoder faster. In this chapter, we test two ACS design on FPGA and evaluate how much improvement we may achieved with the assistance of FPGA. Similar to the case in DSP implementation, the Xilinx FPGA on Quixote board must be controlled by the DSP program. However,

the communication mechanism of the Quixote board does not work. Thus, the simulations shown in following sections are obtained from the Debussy’s nWave tools.

5.1 ACS Design - I

5.1.1 Original ACS Structure

The original ACS structure we designed is shown in Fig. 5.1, where SM1 and SM2 denote the upper state metric and lower state metric of the ACS butterfly structure shown in Chapter 2. And BM1 and BM2 denote the upper branch metric and lower branch metric, respectively, CTL_IN and CTL_OUT denote the input control signal and output control signal, SEL denotes the path record information and N_SM denotes the next state metric after ACS computation. This structure can operate

at around 100MHz, which can be translated to a processing rate of 12.5M (64 states/sec).

The unit “64 states/sec” represents how many 64 state metrics can be computed per second, since the Viterbi algorithm has to compute 64 state metrics to produce 1 decoded output bit. The ACS module implemented on FPGA is much faster than on DSP. The DSP version only achieves 2M (64 states/sec). However, we find a physical limit after we finish the original design. It is the transmission bandwidth limit on our implementation platform. Based on the architecture of Quixote DSP baseboard which has been discussed in Chapter 3, we know that the communication between DSP and FPGA must go through the EMIF (External Memory InterFace) A, and the bandwidth of the EMIF A is 64-bit/133MHz, or 8512Mbps. Although the bandwidth is wide enough for most applications, it is still not sufficient for the ACS module we designed originally. According to Fig. 5.2, which shows the synthesis report generated by Xilinx ISE6.1 for our original design, the ACS module requires 690 bits data transmission for the ACS computation of 16 state metrics. Equivalently, it means 690*4 bits data transmission for decoding 1 bit. Use the notation of the EMIF A, we can translate it to 8512/(690*4 ) = 3.08M (64 states/sec). It means that the bandwidth of EMIF A can only support the processing rate of our original ACS module up to 3.08M (64 states/sec).

Thus, if we do not do any modification, the processing rate of 12.5M (64 states/sec) is meaningless when we actually integrate the FPGA ACS module to the residual DSP program.

Figure 5.2: FPGA Synthesis Report for Original ACS Design.

Device utilization summary:

---

Selected Device : 2v6000ff1152-6

Number of Slices: 1122 out of 33792 3%

Number of Slice Flip Flops: 1104 out of 67584 1%

Number of 4 input LUTs: 1312 out of 67584 1%

Number of bonded IOBs: 690 out of 824 83%

Obviously, the processing rate of our design is limited by the data transmission rate that EMIF A can support. There are two possible solutions to solve this problem. The first one is to find a faster communicatoin interface between the DSP and FPGA, but it seems hard to do so. So we consider another approach: reduce the pin used in our design.

The first improvement we have done on reducing the number of used pin is to avoid inputting the state metric values to FPGA since the state metric value is usually large and require more bits to represent. This can be achieved by storing the state metric values inside the FPGA. Since the initial value of the state metrics are known to be zero, we only have to reset the state metrics to zero at the beginning of decoding and then keep the updated state metrics in the registers inside FPGA for next time stage computation. Another possibility to reduce the pins is to reduce the input data size before they are sent to the FPGA device. Thus, we can represent the branch metric or state metric with fewer bits. After surveying several papers and textbooks, we find that a quantization level of 8 on branch metric is reasonable, according to the study by [20]; It results in a slight decrease on coding gain if the number of quantization level is greater than 8. Together, the 8-level quantization solution with the elimination of state metric transmission can provide a processing rate up to 32 M (64 states/sec), which is much better compared with the original 3.08M (64 states/sec).

5.1.2 Improved ACS Structure

The modified ACS structure is shown in Fig. 5.3. The core module is the ACS64 module, which consists of 64 ACS units for computing the 64 state metric for the next stage every two cycles in parallel. IN_BUF and OUT_BUF denote the input buffer and output buffer respectively, and they are used to link with the DSP device; i.e., this two

metrics for the purpose of eliminating state metric input and output. It is because if we do not send back the state metric to the DSP side, we must implement the comparator on FPGA to select the best terminating state each time the time stage equals to the truncation length of our Viterbi decoder. To make the comparator operate more efficiently, we manually partition a comparator into five stages, and the pipeline stage is controlled by the FSM module. That is to say, the comparator will output the state with the smallest state metric value 5 cycles later after it receives the END_REQ control signal transmitted by the DSP side program.

Figure 5.3: Schematic of Modified ACS Design.

According to the synthesize report for our modified ACS design, which is shown in Fig. 5.4, we can find that the slices used in our new (quantized) ACS design is greater than the original design. It is mainly due to the addition of comparator. As expected, the

在文檔中 IEEE 802.16a標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化 (頁 83-0)