CHAPTER 3 ALGORITHM AND ARCHITECTURE FOR MULTI - MODE FEC
3.5 T HE MEMORY CONSIDERATION FOR TEST CHIP
As mentioned above, the memory requirement of universal convolutional interleaving for ITU-T J.83B is about 64K bytes. And, the memory requirement for multi-mode RS decoder is 752 bytes. Although 64K bytes memory is not so large, we still cannot embed the 64K bytes SRAM in the test chip due to the constraint of test chip area for academic research purpose.
Hence, the 64K bytes memory will be taken as the external memory and only 752 bytes SRAM for RS decoder are embedded in the test chip. So, the system platform is modified as figure 3.20.
mode From De -mapper
mode From De -mapper
out Trellis Decoder &
Synchronization B
out Trellis Decoder &
Synchronization B
Descrambler B
Deinterleaver A/B/C/D Descrambler
B
Deinterleaver A/B/C/D
RS Decoder A/B/C/D
Descrambler A/C/D mode
Test Chip
64K bytes external memory
752 bytes embedded memory
752 bytes embedded memory
M U X
M U X
Figure 3.20: The system platform with memory consideration
3.6 Summary
In this chapter, a multi-mode RS decoder with memories to store and correct received data and a memory-based universal convolutional interleaver and deinterleaver are proposed.
Both of them have the advantage of low overhead, high flexibility to achieve multi-mode design and can be compatible with the standard of J.83, DVB-T and ATSC Digital TV, etc.
The proposed multi-mode RS decoder can support the error correction capability with t = 3, 8 and 10 over GF(27) and GF(28) respectively. BM algorithm is adopted for key equation due to its regularity instead of Euclidean algorithm. And, the proposed multi-mode RS decoder can be easily modified to meet the different requirement of different applications. In addition, the proposed universal convolutional interleaver and deinterleaver can support all kinds of parameters of convolutional interleaving. The parameter design in convolutional interleaving has the advantage for time-to-market. We also mention the implementing method for single mode scrambler and Viterbi decoder. Viterbi decoder takes register-exchange method as the architecture of survivor path storage management since the convolutional codes in J.83B has only 16 states and thus the number of registers required for this decoder is not quite large. For the memory consideration, due to the constraint of test chip area for academic research purpose, the 64K bytes memory for universal convolutional interleaving are taken as the external memory.
Chapter 4
Simulation and Implementation Result
The environment of simulation platform and the result of chip implementation will be shown in this chapter. And, the result of our proposed architecture and chip implementation will do some comparisons with other reference works. By the comparisons, it shows that our proposed architecture and chip has the advantages of low overhead, low power, and high flexibility to achieve multi-mode FEC decoder design.
4.1 Platform and System design
The design flow is illustrated in figure 4.1. Each block is defined as follows:
(1) System platform simulation:
At first, the system platform based on high-level language will do the simulation to verify the proposed algorithm. High-level simulation is very important to guarantee the functionality of the whole system before the hardware design. And, Matlab is chosen as the simulation environment since it has the advantages of simple usage and powerful functionality.
In addition to the functional block of multi-mode FEC decoder for J.83 as shown in figure 3.1, the functional block of multi-mode FEC encoder for J.83 is included in the Matlab platform as shown in figure 4.2. The relationship between the multi-mode FEC encoder and decoder is depicted in figure 4.3. After encoding the test pattern, the noises will be added to the encoded data, where the noises should be within the capability of error correction. Furthermore, the
noisy encoded pattern should be recorded to a file for RTL simulation later. Then, the noisy encoded data are going to be fed to the multi-mode FEC decoder. The output data from the FEC decoder will compare to the original uncoded pattern. After verification of the proposed algorithm and, we can design the architecture and write the RTL code to do the RTL behavior simulation.
System platform simulation
Architecture and RTL simulation
Synthesis and gate level simulation
Auto Place & Route OK
OK
OK
Timing constraint
fail
Postlayout simulaiton Test pattern
Timing constraint
fail
To fab OK
Chip verification OK
Figure 4.1: The design flow
mode in
To QAM Mapper Scrambler
A/C/D
RS Encoder A/B/C/D
Interleaver A/B/C/D
Scrambler B Trellis Encoder
B M
U X
M U X
Figure 4.2: FEC encoder in J.83
FEC Encoder
Test Pattern
FEC Decoder Encoded
Pattern
Dump to file
Decoded Pattern Compare
if equal?
noise
Figure 4.3: Simulation environment
(2) Architecture and RTL simulation:
The RTL code describes the system in hardware level. The architecture and circuit
should be defined according to the timing constraint at first before writing the RTL code. The architecture is mentioned in chapter 4. Here, Verilog is chosen as the hardware description language (HDL). The test bench stored from the Matlab platform should be used to check if the functionality of RTL coding is correct or not. We should note that the RTL simulation takes logic circuit as the ideal behavior. Hence, the gate level simulation is required after synthesis.
(3) Synthesis and gate level simulation:
After checking the RTL behavior simulation, we can do the synthesis and do the gate level simulation. By the aid of synthesis and gate level simulation, we can know the almost real logic gate delay and area of chip. Thus, the chip performance and area complexity can be estimated. If the timing or chip area cannot meet the specification requirement, we should go back to redesign the architecture. In addition, Synopsys® Design Analyzer is our synthesis CAD tool. And, the standard cell library is UMC® 0.18µm 1P6M CMOS technology.
(5) Auto place and route (APR):
After succeeding the gate level simulation, we should place and route the logic gate to layout. Because there are more problems in deep-submicron process, such as signal integrity, IR drop, wire delay, and so on, we should use the new CAD tool “Cadence® SOC Encounter”
to handle the deep-submicron problem. After APR, we should use “Calibre® DRC/LVS”
CAD tools to verify DRC (design rule check) and LVS (layout versus schematic) errors. Then, the postlayout-gate level simulation can be taken to simulate the prototype of the chip. If timing cannot meet the requirement of specification, we should go back to (2) to redesign the architecture.
(6) Postlayout simulation:
Use “Calibre® LPE (Layout parameter extraction)” CAD tool to extract the parameters from layout, such as transistors, capacitances, resistors, and so on. After extraction, we can use “Nanosim” CAD tool to do postlayout simulation. The target of Nanosim is between SPICE and Verilog. It is a transistor-level timing simulator and power dissipation analysis tool for digital circuit design. Thus, it handles current, voltages simulations and timing checks.
After verification of postlayout simulation, we can tape out the chip to fab.
(7) Chip verification:
Using IMS100 to verify the chip in CIC®. The test pattern is generated from the system platform and gate level or postlayout simulation. In addition to verification, the power consumption of the chip will be measured at the same time.
4.2 Chip integration and the results of chip implementation
As mentioned in chapter 3, the memory requirement for universal convolutional deinterleaver is 65032 bytes. The chip area is limited due to academic research purpose.
Hence, using it as the external-memory is a good solution. So, the simulation environment in gate level simulation and postlayout simulation will become the one as shown in figure 4.4.
The 65032 bytes memory is used as the behavior model and does the simulation with the chip.
Only 752 bytes memory for RS decoder are embedded in the test chip. As a result, for the chip verification, we will feed the data from the simulated external memory to chip instead of real external memory for convenience.
65032 bytes Chip Memory
Control Data
Test pattern
Output
Figure 4.4: The chip connected with external memory
Table 2: Summary of CHIP Implementation for J.83 FEC
Technology UMC® 0.18 µm 1P6M CMOS process
Chip size 1.89mm x 1.89mm
Core size 1.28mm x 1.28mm
Gate count 54.5K
Embedded SRAM 752Bytes
Supply voltage 1.8V
Max operating frequency 83MHz (600Mbps)
Average Power
J.83 Annex A&C 25.2mW @83MHz 3.6mW @7MHz J.83 Annex B in
64QAM
43.2mW @83MHz 5.4mW @7MHz J.83 Annex B in
256QAM
45mW @83MHz 5.4mW @7MHz
J.83 Annex D 30.6mW @83MHz 4.5mW @7MHz
Table 2 shows the result and the measurement of the chip implementation. By implementing with UMC® 0.18µm 1P6M CMOS technology, the chip shows that the proposed multi-mode FEC decoder can work at 83MHz (600Mbps) while costs 54.5K logic gate counts, two 376x8 bits embedded dual-port SRAM and 65032 bytes external memory for de-interleaver with only 8 bytes overhead. In fact, 7 MHz has met the requirement of specification. And, the chip size is 1892 x 1892 µm2. The floor plan of chip is shown in figure 4.5. The maximum power consumption is 45mW at 83MHz (5.4mW at 7MHz) with the supply voltage 1.8 volts for J.83B in 256QAM. For more detail about power consumption, please see table 2. It shows that our chip has the advantage of low power requirement.
Universal Convolutional Deinterleaver
RS Decoder
376x8
SRAM 376x8
SRAM
Scrambler
Trellis Decoder
Figure 4.5: The floor plan of the chip
The detailed gate counts of each module are listed in table 3, where trellis decoder contains two Viterbi decoders and the circuit of synchronization for FEC frame[1]. Table 3 also shows the logic gate counts of RS Decoder in ITU-T J.83D which is the most complex RS code in ITU-T J.83. It shows that the proposed multi-mode RS decoder is only larger about 1.1K gate counts than that specified in J.83D. In other words, the proposed multi-mode RS decoder has only the overhead of 6% compare to the most critical mode.
Table 3: Gate Count for each module
Module Logic gate count
Multi-Modes RS Decoder 19051
Universal deinterleaver 8306
Viterbi Decoder 9883
Trellis decoder (contains 2 Viterbi Decoder) 24632
Scrambler 1190
Overall FEC Decoder 54542
J.83D RS Decoder 17963
Compare the proposed architecture for multi-mode RS decoder with other reference works as shown in table 4, although [10], [21] and [29] support only one mode, their gate counts or throughput rate are not better than the proposed work. Besides, compare the proposed memory-based universal convolutional deinterleaver with other people’s works, (12, 17) convolutional deinterleaver in [10] requires memory size of 1280 bytes with two 128-byte RAM and four 256-byte RAM, that is, overhead is 158 bytes. In [21], (15, 17) convolutional deinterleaver needs 1829 bytes with 44 bytes overhead. For the proposed algorithm and architecture in the same convolutional deinterleaver, we only have the overhead of 17 bytes memory and a low complexity controller. Furthermore, in [21], and [10], they can only meet
for suitable standard using the same component, but the proposed multi-mode FEC decoder can be used in many standards, such as ITU-T J.83, DVB-T, ATSC Digital TV, etc. Hence, the proposed architecture has the advantage of low-overhead, high throughput rate and high flexibility to achieve multi-mode design.
Table 4: Comparisons between the proposed architecture and other reference works
Proposed [21] [10] [29]
Technology 0.18µm 0.6µm FPGA 0.25µm
Mode Multi-mode Single-mode Single-mode Single-mode
m 7, 8 8 8 8
t 3, 8, 10 16 8 8
RS decoder
Gate counts 19K 55K
Mode Universal
Single-mode (15, 17)
Single-mode (12, 17) Convolutional
deinterleaving Memory
overhead J, 1 ≤ J ≤ 17 44 bytes 158 bytes
Throughput 600Mbps 73Mbps 600Mbps
4.3 Summary
The chip implementation of the proposed multi-mode FEC decoder is introduced in this chapter. With 0.18µm 1P6M CMOS technology, the implemented chip shows that the FEC decoder can work at 83MHz (600Mbps) while costs 54.5K gate counts and two 376x8 bits embedded duel-port SRAM. The chip size is 1.89mm x 1.89mm. And the average power consumption in full spec. mode is about 45mW at 83MHz. While running at 7MHz that meets
symbol rate of cable modem, the power dissipation is 5.4mW. Compare to other people’s work, the proposed architecture shows that it has the advantage of low-overhead, high throughput rate requirement and high flexibility to achieve multi-mode design.
Chapter 5
Conclusion and Future Work
5.1 Conclusion
In this thesis, a solution to design a multi-mode FEC decoder is proposed. It mainly contains a multi-mode RS decoder for different finite field and different capability of error correction with memories to store and correct received data and a memory-based universal convolutional interleaver/deinterleaver. Both of them have the advantage of low-overhead, and high flexibility to achieve multi-mode FEC design. And, this multi-mode FEC decoder can be adopted in J.83 cable modem system, DVB-T system, and so on.
To design the multi-mode FEC decoder systematically, we began from the system view and built a high-level simulation platform by Matlab to verify the proposed algorithm and architecture at first. J.83 cable system is chosen as the simulated platform since it is the most complex system among those communication systems with the similar modules, such as DVB-T, J.83, ATSC Digital TV, and so on. Then, we construct the hardware architecture in RTL-level by Verilog. By implementing with UMC® 0.18µm 1P6M CMOS technology, the chip shows that the proposed multi-mode FEC decoder can work at 83MHz (600Mbps) while costs 54.5K logic gate counts, two 376x8 bits embedded dual-port SRAM and 65032 bytes external memory for de-interleaver with only 8 bytes overhead. And, chip size is 1.89mm x 1.89mm. Compare to other related works, our proposed architecture has the advantage of high throughput rate, low-overhead and high flexibility to achieve multi-mode design to reduce the design cost.
5.2 Future Work
As mentioned in chapter one, channel coding is a key module to minimize the effect of channel noise during data transmission, especially in wireless communications. For wireless communications in the future, designing an error control code to achieve Shannon bounds is more and more important. Those concatenated codes, such as FEC in J.83, will not meet the requirement of future wireless communications.
Thus, the iterative decoding algorithm[22] is used to achieve Shannon limits more close.
Both turbo codes[26] and LDPC (Low Density Parity Check) codes[23] adopt this idea of iterative decoding. Turbo codes were proposed in 1993. Turbo encoder usually comprises the parallel concatenation of two RSC (recursive systematic convolutional codes) and one interleaver to encode the information as shown in figure 5.1, where π means interleaver and x0, x1 and x2 are encoded datum. As the block length of interleaving increases, the performance of turbo codes is more close to Shannon bounds. Turbo codes has been adopted in third generation mobile systems, such as 3gpp and 3gpp2 systems since CDMA system needs a powerful error correction codes to increase the channel capacity.
π
RSC
RSC
u x
0x
1x
2Figure 5.1: Turbo encoder
On the other hand, LDPC codes were created in 1962 by Gallager, but were rediscovered in 1995, 1996[24][25]. LDPC codes are one kind of block codes but the parity check matrix is sparse compared to the traditional block codes. Same as the turbo codes, as the block length increases, the iterative decoding algorithm can achieve more near Shannon limits. And, the advantage of LDPC codes over turbo codes contains:
(1) They do not require a long interleaver.
(2) They have better block error performance.
(3) Their error floor occurs at a much lower BER (Bit Error Rate).
(4) Their decoding is not trellis based, so they are suitable for high throughput rate requirement due to its parallel characteristic.
Due to the above advantage, the next generation wireless communications are considering using LDPC codes as their error control codes instead of turbo codes, such as UWB (ultra wide band) system and DVB-S2 for high reliability and high throughput rate requirement. The decoding algorithm and the performance of LDPC codes can be seen in Appendix-A.
Bibliography
[1] ITU-T, Telecommunication Standardization Sector of ITU, “Digital multi-programme systems for television sound and data services for cable distribution”-Digital transmission of television signals, ITU-T Recommendation J.83, Apr. 1997.
[2] ETSI, “Digital Video Broadcasting (DVB); Framing structure, channel coding and modulation for digital terrestrial television”-EN 300 744 V1.1.2, Nov. 1998.
[3] ATSC Digital Television Standard, Sep. 1995.
[4] H. C. Chang, C. B. Shung, and C. Y. Lee, “ A Reed-Solomon Product-Code (RS-PC) Decoder Chip for DVD Applications,” IEEE J. Solid-State Circuits, Vol. 36, No. 2, pp.
229-238, Feb. 2001.
[5] J. L. Ramsey, “Realization of Optimum Interleavers, “ IEEE Trans. on Inform. Theory, vol. IT-16, no. 3, May 1970.
[6] Y. X. You, J. X. Wang, and X. R. Piao, “Design and Implementation of Concatenated Encoder,” in Int. Conf. ASIC, Oct. 2001.
[7] H. Yang, Y. Zhong, and L. Yang, “An FPGA Prototype of A Forward Error Correction (FEC) Decoder For ATSC Digital TV,” IEEE Trans. on Consumer Electron, vol. 45, no.
2, pp. 387-395, May 1999.
[8] G. D. Forney, Jr., “Burst-Correcting Codes for the Classic Bursty Channel”, IEEE Trans.
on Communications, vol. 19, no. 5, pp. 772-781, Oct. 1971.
[9] H. C. Chang, C. C. Lin, and C. Y. Lee, “ A Low-Power Reed-Solomon Decoder For STM-16 Optical Communications,” in IEEE Asia-Pacific Conf. ASIC, Aug. 2002.
[10] J. B. Kim, Y. J. Lim, and M. H. Lee, “A Low Complexity FEC Design for DAB,” in ISCAS, May 2001.
[11] R. J. McEliece, The Theory of Information and Coding, 2nd ed. Cambridge, UK:
Cambridge University Press, 2002.
[12] S. Lin and D. J. Costello, Jr., Error Control Coding, Fundamentals and Applications.
Englewood Cliffs, NJ: Prentice-Hall, 1983.
[13] J. B. Cain, G. C. Clark, and J. M. Geist, “Punctured convolutional codes of rate (n-1)/n and simplified maximum likelihood decoding,” IEEE Trans. on Inform. Theory, vol.
IT-25, No. 1, pp. 97-101, Jan. 1979.
[14] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. on Inform. Theory, vol. IT-13, pp. 260-269, April.
1967.
[15] G. D. Forney, Jr., “Convolutional Code II: Maximum likelihood decoding,” Information and Control, 25, pp. 222-266, July 1974.
[16] Dalia A. F. El-Dib and Mohamed I. Elmasry, “Low-Power Register-Exchange Viterbi Decoder For High-Speed Wireless Communications,” IEEE International Symposium on Circuits and Systems, vol. 5, pp. 737-740, 2002
[17] S. R. Meier, M. Steinert, S. Buch, “Testability of Path History Memories with Register-Exchange Architecture Used in Viterbi-Decoders,” IEEE International Symposium on Circuits and Systems, vol. 3, pp. 165-168, 2002
[18] Andries P. Hekstra, “An Alternative to Metric Rescaling in Viterbi Decoders,” IEEE Trans. on Communications, vol. 37, NO. 11, pp 1220-1222, Nov. 1989.
[19] Richard E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley Publishing Company, 1983
[20] Gennady Feygin, and P. G. Gulak, “Architectural Tradeoffs for Survivor Sequence Memory Management in Viterbi Decoders,” IEEE Trans. on Communications, vol. 41, NO. 3, pp. 425-429, March 1993.
[21] Daniel A. Luthi, Advait Mogre, Nadav Ben-Efraim, Alok Gupta, “A single-chip concatenated FEC decoder,” IEEE custom integrated circuits conference, pp. 285-288, May 1995.
[22] Joachim Hagenauer, Elke Offer, and Lutz Papke, “Iterative Decoding of Binary Block and Convolutional Codes,” IEEE Trans. Inform. Theory, vol. 42, No. 2, pp. 429-445, March 1996.
[23] R. G. Gallager, “Low Density Parity Check Codes,” IRE Trans. Inform. Theory, vol.
IT-8, pp. 21-28, Jan. 1062.
[24] Niclas Wiberg, Hans-Andrea Loeliger, and Ralph Kotter, “Codes and Iterative Decoding on General Graphs,” IEEE International Symposium on Information Theory, pp. 468, Sept. 1995.
[25] D. J. C. MacKay and R. M. Neal, “Near Shannon limit performance of low density parity check codes,” IEE Electronics Letters, vol. 32, Issue: 18, pp. 1645, Aug. 1996.
[26] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: turbo-codes,” IEEE Int. conf. Communications (ICC), pp.
1064-1070, May 1993.
[27] D. J. C. MacKay, “Good error-correcting codes based on very sparse matrices,” IEEE Trans. Inform. Theory, vol. 45, pp. 399-431, Mar. 1999.
[28] Xiao-Yu Hu, Evangelos Eleftheriou, Dieter-Michael Arnold, and Ajay Dholakia,
“Efficient Implementations of the Sum-Product Algorithm for Decoding LDPC Codes,”
IEEE Global Telecommunications Conference, vol. 2, 25-29, pp. 1036 - 1041, Nov.
2001.
[29] H. Lee, M. L. Yu, and L. Song, “VLSI Design of Reed-Solomon Decoder Architecture,”
IEEE ISCAS, May 2000.
Appendix-A
Decoding algorithm of LDPC codes
The decoding algorithm of LDPC codes is called iterative Sum-Product Algorithm (SPA), or message passing (MP) algorithm, belief propagation (BP) algorithm. The behavior of MP algorithm for LDPC codes can be expressed a bipartite graph for parity check matrix H as shown in figure A.1. The message of variable node and function node pass to each other iteratively.
X
1X
2X
3X
4X
5X
6X
7fB
fA fC
Variable node
Function node
X
1X
2X
3X
4X
5X
6X
7X
1X
2X
3X
4X
5X
6X
7f
Af
Bf
C1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 0 0 1 H =
Figure A.1: The message passing on bipartite graph of LDPC codes
To explain SPA to decode LDPC codes, we have some notations defined for parity check matrix H at first. We denote the set of bits n that participate in check m by N(m)≣{n : Hmn = 1}. Similarly, we define the set of checks m in which bit n participates, M(n)≣{m : Hmn = 1}.
We denote a set N (m) with bit n excluded by N (m) \ n. And, the algorithm has two parts, in which quantities qn→ m and rm→ n associated with each nonzero element in the H matrix are
We denote a set N (m) with bit n excluded by N (m) \ n. And, the algorithm has two parts, in which quantities qn→ m and rm→ n associated with each nonzero element in the H matrix are