Simulation Profile for FEC Decoder

4 Implementation and Optimization of 802.16a FEC Scheme on DSP Platform

4.5 Simulation Results

4.5.6 Simulation Profile for FEC Decoder

Table 4.23 shows the average improved profile obtained in decoding the coded data generated in the above simulation. We find that the average processing rate of the FEC decoder can achieve 750 Kbits/sec with I/O included or 960 Kbits/sec with I/O excluded after improvement.

Modulation RS Code CC Code

Rate Code Size Cycles W/ W/O

Processing Rate (Kbits/sec)

W/ W/O

QPSK (24,18,3) 2/3 12836 815093 650539 636 797 QPSK (30,26,2) 5/6 12960 673042 535694 742 932 16-QAM (48,36,6) 2/3 12652 696082 533861 745 971 16-QAM (60,54,3) 5/6 12936 591806 456128 876 1137

Table 4.23: Profile of Forward Error Correction Decoder.

Chapter 5 ACS Unit Acceleration by Employing Xilinx FPGA as an Assistant

Based on the simulation results discussed in Chapter 4, we know the speed bottleneck of our FEC program is the soft decision decoding Viterbi decoder.

Furthermore, we know that the most time consuming kernel in the Viterbi decoder is the Add Compare Select (ACS) unit. In order to speed up the ACS unit, we have done some optimization on DSP platform. However, the final speed is still slower than we wish. The reason for the slow operating speed of the ACS unit is the massive sequential computations required to obtain a single output bit; i.e., it requires 64 ACS computations to obtain a single bit output. We notice that the ACS unit is suitable for the FPGA implementation, on which we can design and allocate as many functional units as we want as long as it does not exceed the area limit of the FPGA. Clearly, we can accelerate the ACS unit based on the Xilinx FPGA XC2V2000, which is embedded on the Quixote DSP baseboard, by simply placing 64 ACS units and make them operated in parallel on FPGA. And then integrated it with the original DSP program to make the overall speed performance of our FEC decoder faster. In this chapter, we test two ACS design on FPGA and evaluate how much improvement we may achieved with the assistance of FPGA. Similar to the case in DSP implementation, the Xilinx FPGA on Quixote board must be controlled by the DSP program. However,

the communication mechanism of the Quixote board does not work. Thus, the simulations shown in following sections are obtained from the Debussy’s nWave tools.

5.1 ACS Design - I

5.1.1 Original ACS Structure

The original ACS structure we designed is shown in Fig. 5.1, where SM1 and SM2 denote the upper state metric and lower state metric of the ACS butterfly structure shown in Chapter 2. And BM1 and BM2 denote the upper branch metric and lower branch metric, respectively, CTL_IN and CTL_OUT denote the input control signal and output control signal, SEL denotes the path record information and N_SM denotes the next state metric after ACS computation. This structure can operate

at around 100MHz, which can be translated to a processing rate of 12.5M (64 states/sec).

The unit “64 states/sec” represents how many 64 state metrics can be computed per second, since the Viterbi algorithm has to compute 64 state metrics to produce 1 decoded output bit. The ACS module implemented on FPGA is much faster than on DSP. The DSP version only achieves 2M (64 states/sec). However, we find a physical limit after we finish the original design. It is the transmission bandwidth limit on our implementation platform. Based on the architecture of Quixote DSP baseboard which has been discussed in Chapter 3, we know that the communication between DSP and FPGA must go through the EMIF (External Memory InterFace) A, and the bandwidth of the EMIF A is 64-bit/133MHz, or 8512Mbps. Although the bandwidth is wide enough for most applications, it is still not sufficient for the ACS module we designed originally. According to Fig. 5.2, which shows the synthesis report generated by Xilinx ISE6.1 for our original design, the ACS module requires 690 bits data transmission for the ACS computation of 16 state metrics. Equivalently, it means 690*4 bits data transmission for decoding 1 bit. Use the notation of the EMIF A, we can translate it to 8512/(690*4 ) = 3.08M (64 states/sec). It means that the bandwidth of EMIF A can only support the processing rate of our original ACS module up to 3.08M (64 states/sec).

Thus, if we do not do any modification, the processing rate of 12.5M (64 states/sec) is meaningless when we actually integrate the FPGA ACS module to the residual DSP program.

Figure 5.2: FPGA Synthesis Report for Original ACS Design.

Device utilization summary:

---

Selected Device : 2v6000ff1152-6

Number of Slices: 1122 out of 33792 3%

Number of Slice Flip Flops: 1104 out of 67584 1%

Number of 4 input LUTs: 1312 out of 67584 1%

Number of bonded IOBs: 690 out of 824 83%

Obviously, the processing rate of our design is limited by the data transmission rate that EMIF A can support. There are two possible solutions to solve this problem. The first one is to find a faster communicatoin interface between the DSP and FPGA, but it seems hard to do so. So we consider another approach: reduce the pin used in our design.

The first improvement we have done on reducing the number of used pin is to avoid inputting the state metric values to FPGA since the state metric value is usually large and require more bits to represent. This can be achieved by storing the state metric values inside the FPGA. Since the initial value of the state metrics are known to be zero, we only have to reset the state metrics to zero at the beginning of decoding and then keep the updated state metrics in the registers inside FPGA for next time stage computation. Another possibility to reduce the pins is to reduce the input data size before they are sent to the FPGA device. Thus, we can represent the branch metric or state metric with fewer bits. After surveying several papers and textbooks, we find that a quantization level of 8 on branch metric is reasonable, according to the study by [20]; It results in a slight decrease on coding gain if the number of quantization level is greater than 8. Together, the 8-level quantization solution with the elimination of state metric transmission can provide a processing rate up to 32 M (64 states/sec), which is much better compared with the original 3.08M (64 states/sec).

5.1.2 Improved ACS Structure

The modified ACS structure is shown in Fig. 5.3. The core module is the ACS64 module, which consists of 64 ACS units for computing the 64 state metric for the next stage every two cycles in parallel. IN_BUF and OUT_BUF denote the input buffer and output buffer respectively, and they are used to link with the DSP device; i.e., this two

metrics for the purpose of eliminating state metric input and output. It is because if we do not send back the state metric to the DSP side, we must implement the comparator on FPGA to select the best terminating state each time the time stage equals to the truncation length of our Viterbi decoder. To make the comparator operate more efficiently, we manually partition a comparator into five stages, and the pipeline stage is controlled by the FSM module. That is to say, the comparator will output the state with the smallest state metric value 5 cycles later after it receives the END_REQ control signal transmitted by the DSP side program.

Figure 5.3: Schematic of Modified ACS Design.

According to the synthesize report for our modified ACS design, which is shown in Fig. 5.4, we can find that the slices used in our new (quantized) ACS design is greater than the original design. It is mainly due to the addition of comparator. As expected, the IOBs used in the new design are greatly reduced because of the use of quantization on

branch metric and the avoidance for outputting intermediate state metric values. Thus the processing rate constraint can be greatly reduced. From the synthesis report, the estimated clock rate is around 161Mhz. From the P&R report, which is shown in Fig.

5.5, the post P&R clock rate of our design is 1/11.089ns = 90Mhz, due to the fact that the output value is valid every 2 cycles, it can translate to a processing rate of 45M (64 states/sec). The waveform of the modified ACS design is shown in Fig. 5.6, the control signal IN_VALID is transmitted by DSP to inform FPGA that the data in input buffer is valid and the control signal OUT_VALID is transmitted by FPGA to inform DSP that the output data in the output buffer is valid, the control signal END_REQ is sent by DSP based on the truncation length of our Viterbi decoder to inform FPGA that DSP now requires the number of best state to perform the Traceback operation, as described above, the result of comparison is valid in 5 cycles later and the FPGA will send a

Device utilization summary:

---

Selected Device : 2v2000ff896-6

Number of Slices: 3785 out of 10752 35%

Number of Slice Flip Flops: 3049 out of 21504 14%

Number of 4 input LUTs: 6131 out of 21504 28%

Number of bonded IOBs: 266 out of 624 42%

Timing Summary:

--- Speed Grade: -6

Minimum period: 6.175ns (Maximum Frequency: 161.956MHz) Minimum input arrival time before clock: 4.349ns

Maximum output required time after clock: 4.812ns Maximum combinational path delay: No path found

control signal DONE to inform DSP that the result of comparison is valid now. One thing to be noticed is that the time interval between two consecutive IN_VALID signal cannot be less than 2 clock cycle time since the ACS unit need 2 cycles to complete the

Figure 5.5: Place and Route Report for Modified ACS Design.

computation and raise the OUT_VALID signal.

Figure 5.6: Waveform of Modified ACS Design.

5.2 ACS Design - II

Although the processing rate of the previous design, which is 45M (64 states/sec), already exceeds the physical limitation, which is 32M (64 states/sec) of the transmission bandwidth. For research and evaluation, we try to further improve the ACS design.

According to [21], a new architecture of the ACS unit is proposed to accelerate the ACS unit by increasing the states of trellis. By reformulating the Viterbi algorithm, the proposed architecture provides an alternative approach to the high-throughput design.

The NUMBER OF SIGNALS NOT COMPLETELY ROUTED for this design is: 0

The AVERAGE CONNECTION DELAY for this design is: 1.082 The MAXIMUM PIN DELAY IS: 11.089 The AVERAGE CONNECTION DELAY on the 10 WORST NETS is: 5.369

The new proposed architecture is called “double state” architecture, which refers to the fact that it requires twice more states than the original architecture. By having double states, the serial operation of ACS unit can be transformed into parallel operation.

The basic concept of this new architecture is briefly introduced here. For simplicity, we take the convolutional code with generator polynomial X+1 as an example. By extending the original generator polynomial of the convolutional code one degree and set the coefficient of the highest degree to zero, one can obtain an equivalent trellis as illustrated in Fig. 5.7, where X+1 is the original generator polynomial and 0 X² + X + 1 is the extended generator polynomial.

Figure 5.7: Two Equivalent Trellises.

In the double state trellis, the branch metrics (BMs) ending at the same state are all equal. That is to say, when comparing the state metrics for the next stage, we can select the minimum of the two previous stage state metrics without waiting for an addition of the BMs. Fig. 5.8 (a) shows the original ACS architecture while Fig. 5.8 (b) shows the modified “double state” ACS architecture, where n denotes the current stage, n+1

have to wait for the addition of BMs and can just compare the current two state metrics directly. Equivalently, we perform the

Fig. 5.8: (a) Original ACS Architecture.

Fig. 5.8: (b) ACS Architecture Based on Double State Trellis.

SMi

n k

BMi_,

+1 n

SMk

n k

BMj_,

SMi

SMj

+1 n

SMk n

k j n

i BM

BM_,

=

following recursion: _iⁿ _kⁿ

i n

k SM BM

SM ⁺¹ =min( )+ . Therefore, we can make the addition of BMs and the comparison of the SMs operate in parallel while the original ACS architecture must operate sequentially. Moreover, another hardware savings is also proposed based on the double state architecture. Looking at the next states 10 and 11 in Fig. 5.7, they share the same pair of the current states 01 and 11. Hence, if the next state 11 choose the path from the current state 01 over one from the current state 11, then the same decision is made at the “Select” operation for the next state 11. Therefore, every two states in the “double state” architecture can share the same decision-making unit in the “Compare” operation.

The schematic of the new ACS design based on double state trellis architecture is almost the same as before, the only difference is the ACS64 unit should be extend to ACS128 to deal with the double state ACS, the synthesis report for this new ACS design is shown in Fig. 5.9. From the synthesis report, the estimated clock rate of our

Device utilization summary:

---

Selected Device : 2v2000ff896-6

Number of Slices: 6782 out of 10752 63%

Number of Slice Flip Flops: 5594 out of 21504 26%

Number of 4 input LUTs: 6131 out of 21504 48%

Number of bonded IOBs: 266 out of 624 42%

Timing Summary:

--- Speed Grade: -6

Minimum period: 4.347ns (Maximum Frequency: 230.053MHz)

new double state design is about 230MHz (1.43 times faster than the previous design), but the area increased is about 1.8 times. Though in our case, the area is not the major consideration, in actual ASIC design, it may not be a good solution considering both area and speed. From the P&R report shown in Fig. 5.10, the actual clock rate on FPGA is greatly decreased due to the routing factor, the post P&R clock rate is 1/10.214ns = 98MHz, which can be translated to a bitrate of 49M (64 states/sec).

Figure 5.10: Place and Route Report for Double State ACS Design.

Figure 5.11 Macro Statistics of the (a) Original ACS Design. (b) Double State ACS Design.

The NUMBER OF SIGNALS NOT COMPLETELY ROUTED for this design is: 0

The AVERAGE CONNECTION DELAY for this design is: 1.185

Fig. 5.11 (a) shows the macro statistics of the original ACS design and (b) the double state ACS design generated by Xilinx ISE 6.1. Compared with the original design, the double state design requires twice more adders, SM registers (the 13-bit register), multiplexers, and 64 more comparators to compare 128 resulting SMs for traceback.

Figure 5.12 shows the waveform of the double state ACS design. It is similar to the original ACS design. The only difference is that the endstate of the double state design may be the same state of the original design or it may be the state which only differs in the MSB from the endstate of the original design.

Figure 5.12: Waveform of Double State ACS Design.

Chapter 6 Conclusion and Future Work

6.1 Conclusion

In this thesis, our original target is to implement the FEC scheme of IEEE 802.16a wireless communication standard on the Innovative Integration’s Quixote DSP baseboard, which houses the Texas Instruments TMS320C6416 DSP chip. We optimize the FEC scheme to achieve real-time operation. However, due to the malfunction of the Quixote board, we are not able to actually transmit data between the DSP board and the host PC. We thus can only simulate the modified FEC schemes on the TI DSP simulator, which is included in the TI code composer studio. Therefore, the simulation profile shown in this thesis may not match the exact profile obtained from the DSP emulator operated on Quixote board. However, we believe that the algorithm/program modifications we have done based on the DSP simulator are also valid on the actual Quixote board, and the improving rate should be similar.

In the previous chapters, we first introduce the FEC standard of IEEE 802.16a briefly. We then describe how we implement it and explain the algorithms we use. Next, we introduce our implementation environment to show the available hardware resources.

Afterward, we apply some useful techniques to speed up the software. They are either tuning the program to make the compiler work more efficiently or modifying the programs based on the algorithms either proposed by ourselves or proposed by the other

researchers to improve the processing rate of the RS encoder and decoder, and the convolutional decoder (Viterbi decoder). With DSP optimization, we have achieved an average processing rate of 7984 Kbits/sec on the FEC encoder with I/O included or 9072 Kbits/sec without I/O and an average processing rate of 750 Kbits/sec on the FEC decoder with I/O included or 960Kbits/sec without I/O. (Once again, the actual processing rate on the DSP emulator operated on the Quixote board may be different from these results.)

To further improve the processing rate of the FEC decoder, we need to accelerate the ACS operation in the Viterbi decoder, which is the bottleneck of our implementation.

We put the ACS operation on the Xilinx FPGA, which is located on the Quixote board as an assistant hardware resource. We have done two simulations to evaluate the processing rate of the ACS operation implemented on Xilinx FPGA. The first ACS design that simply makes the ACS units operate in parallel, can achieve a processing rate of 45M (64 states/sec). It is much faster than the original speed of 2M (64 states/sec) on DSP. The second ACS design based on double state trellis architecture can achieve a processing rate of 49M (64 states/sec). But the area it consumed is 1.8 times more than that of the first ACS design. In consideration of the limit on physical transmission bandwidth on Quixote board, the processing rate of ACS units beyond 32M (64 states/sec) does not result in actual data processing speed improvement. Therefore, if we want to implement the ACS units on Xilinx FPGA, the first ACS design can be accepted to achieve a good balance of area and speed.

6.2 Future Work

As mentioned above, the communication mechanism of the Quixote DSP

the streaming mechanism shall be concatenated to our FEC program to actually test the exact processing rate on the Quixote board.

As for the optimization tasks, there are still some functions can be further improved to accelerate the overall processing rate of our FEC program. For the RS decoder, the Chien search still consumes a lot of execution time if the data errors happen to occur in the last symbol of the codeword. If we can find a fast algorithm for Chien search, whose computational complexity is independent to the position of the errors, then the processing rate of the RS decoder can be much improved. For the Viterbi decoder, the ACS operation consumes a lot of execution time due to the need of expanding the entire trellis structure. But in fact only a small portion of the trellis structure is used to generate the decoding path. So, if we can find a modified Viterbi algorithm that does not expand the entire trellis structure, but it only expands a part of the trellis needed for generating the decoding path, we can accelerate the whole FEC decoder by a large factor because the ACS operation is the bottleneck of our program.

Bibliography

[1] IEEE Standard for local and metropolitan area networks, Part 16, Amendment 2.

[2] D. Wilson, “An Efficient Viterbi Decoder Implementation for the ZSP500 DSP Core,” Adv. DSP Dev., LSI Logic.

[3] I. S. Reed and X.-M. Chen, Error-Control Coding for Data Networks.

Kluwer Academic Publishers, Dordrecht, 1999.

[4] C. Paar, “A new architecture for a parallel finite field multiplier with low complexity based on composite fields,” IEEE Trans. on Comp., vol. 45, pp.

856-861, Jul. 1996.

[5] C. Paar and G. Orlando, “A Super-Serial Galois Fields Multiplier for FPGAs and its Application to Public-Key Algorithms,” Seventh Annual IEEE Symp.

on Field-Programmable Custom Computing Machines, FCCM ‘99, pp.

232-239, Apr. 1999.

[6] E. Savas and C,. K. Koc, “Efficient Methods for Composite Field Arithmetic,” Elect. and Comp. Eng., Oregon State University, Dec. 1999.

在文檔中 IEEE 802.16a標準之前向誤差改正編碼於數位訊號處理器平台上之實現與最佳化 (頁 94-113)