C HIP I MPLEMENTATION - THE HIGH SPEED TURBO DECODER DESIGN II

CHAPTER 4 THE HIGH SPEED TURBO DECODER DESIGN II

4.5 C HIP I MPLEMENTATION

The proposed 1Gbps turbo code reduces the code rate to 1/3 without any puncturing compared with our previous design, and the implementation applies maximum log-MAP algorithm with a scaling factor 0.75. The detail specifications are listed in Table. 4.1, and the post-layout view is shown in Fig. 4.13 with pin counts 208. The performance of this design is the same as that described in chapter 3. The power management of this design is more careful due to the failure of the delay lock loop (DLL) circuit. Fig. 4.12 shows that we have applied the power isolation technique on our design for the DLL module. The power supply of DLL is isolated from the other circuits, thus the power noise of whole chip will not affect the DLL.

Furthermore, the internal clock would be more stable and the uncertainty of internal clock tree

will be smaller because the isolation provides a stable working environment for DLL.

Fig. 4.12 Power isolation of DLL

Fig. 4.13 Chip layout view

Table 4.2 Summary of the proposed 1Gbps turbo decoder

Interleaver IBP interleaver (p, s)=(15,23)

Sliding Window 32

Code Rate 1/3

Block length 4096(128 x 32)

Quantization 6 bits (3.3)

iteration 8

4.6 Summary

The proposed 1Gbps turbo decoder is the first turbo decoder chip which achieves 1Gbps throughput. We modified the utilization of processing elements and made the decoding schedule more efficient. The improvement of throughput is marvelously 50%. The implementation of interleaver is more flexible than the previous design and the proposed 1Gbps turbo decoder can support multiple code lengths. The energy efficiency of this design improved from 0.22 to 0.144 nJ per bit per iteration compared with our previous design, which is accredited by the radix 4x4 ACS structure and the advanced process. The proposed design is the fastest and most efficient turbo decoder in the state of the art.

Chapter 5 Highly Parallel Decoding of Turbo code

In parallel decoding of turbo code, there are three issues should be particularly considered:

z The throughput of each decoder (processing element) z The utilization of each decoder

z Parallelism (number of decoders)

In previous chapters, we have proposed some methods to improve the throughput of each decoder and made the decoding more efficient. In this chapter, we will put emphasis on the parallelism. In chapter 3 and chapter 4, the way we used to break the forward and backward recursions for parallelism is to partition a whole codeword into many sub-codewords. But the length of sub-codeword is limited by the distance property of short block length. The distance property of a component code with constraint length 4 is getting worse when the length is below 100. That’s why we choose a sub-codeword length 128. Fig. 5.1 shows the architecture described in chapter 4, and the decoder with 16 processing elements achieves 1Gbps. The sub-codeword length is fixed to 128. If we would like to apply more processing elements in the design, we have to find some new approaches for parallelism. The problem returns to

“How to decode one sub-codeword with multiple decoders?” In the following section, we will discuss the approach mentioned in [17] and [18] for parallel decoding and show the innovation of our new architectures.

Fig. 5.1 The architecture of 1Gbps turbo decoder

5.1 A Sectionalized Method for Parallel Decoding

In Trellis-based turbo decoder, the forward and backward recursions connect the relation between symbols. The path metric values inherited from the previous Trellis stage make the parallel decoding of a codeword difficult. Even if applying the sliding window approach in fig.5.2, we still suffer from the connecting relation of the forward recursion.

Fig. 5.2 Decoding procedure of sliding window approach 5.1.1 A sectionalized method

The solution of the inheritance of the initial value mentioned in [17] and [18] is to store the needed initial values in this iteration and to apply them in the next iteration. Fig. 5.3 shows the detail procedure of the sectionalized method. In the first iteration, the initial values needed in the next iteration are not available, thus the initial values will be set to zero. After the first iteration, the needed values will be calculated and stored in the memories. Therefore, the initial values will be accessed from the second iteration to the last one. Fig. 5.4 shows different sizes of sectionalized Trellis. The codeword of total block length N can be

sectionalized into different ‘fixed-length’ parts. 4T in Fig. 5.4 means the sectionalized part consists of four Trellis stage, as well as 4 symbols. The 8T and 16T cases are the same as 4T and so on. With the some sub-codeword length, the smaller section we partition, the higher parallelism we get. Note that if the N is getting bigger, the more storage we pay for. The storage of initial values consists of α and β, and the α and β initial values of different decoding round should be stored separately. For example, with a block length 64, state number 8, quantization 6 bits, and sectionalized to 4T case, total bits of the initial storage is 768.

Fig. 5.3 A sectionalized method

Fig. 5.4 Different sizes of the sectionalized method

5.1.2 Parallel decoding with the sectionalized method

When the recursion relation breaks by the initial storage method, it makes the parallel decoding of a codeword simple. Fig. 5.5 shows the comparison between the sliding window approach and the sectionalized method. The sliding window approach calculates the dummy β for the initial of the real β calculation. Besides, due to the forward recursion, this approach can’t apply multiple decoders to decode concurrently. On the other hand, the sectionalized method decodes concurrently by accessing initial values for α and β initial, and saves the calculation time and hardware of the dummy β in the sliding window approach. The comparison of the trade-off will be discussed in the following sections.

Fig. 5.5 Comparison of different structures

5.2 Proposed Architectures

The sectionalized method partitions a codeword into several sub-blocks and makes higher order parallelism possible. This method can be applied on our design to partition the sub-codewords into some fix-length sub-blocks. The combination of these two methods makes the higher order parallelism possible and can be accounted a ‘two-dimension’ parallel decoding. The first dimension of the parallelism is called ‘inter-codeword’ parallelism, which is used and introduced in chapter 3 and 4. The processing elements decode different sub-codewords at the same time. The second dimension of the parallelism is called

‘intra-codeword’ parallelism, which makes the processing elements decode the same sub-codeword concurrently. The combined method makes all kind of parallel structures possible under the contention-free constraint for memory-based design.

5.2.1 A two-dimension parallel architecture

Fig. 5.6 shows a two-dimension parallel method, which can be considered as a fully parallel type decoder. The architecture can be applied on a highly parallel situation. The contention-free constraint for the interleaver design in the case will be much more complicated. The two-dimension contention constraint should be considered and the interleaver which meets the constraint is few and hard to find. Thinking of the IBP interleaver mentioned in chapter 3, it consists of two-stage permutations and it is contention-free in both two dimensions. The IBP interleaver can be applied in the architecture.

Fig. 5.6 A two-dimension parallel method

5.2.2 A intra-codeword parallel architecture

A downgrade architecture called ‘intra-codeword parallel architecture’ is the version only in one sub-codeword dimension. In this architecture, we only have to consider the contention

problem in one sub-codeword. It makes things easier. This architecture decode one sub-codeword each time with multiple processing elements. The sub-codewords will take turns to the decoder and go back to the memories.

Fig. 5.7 A intra-codeword parallel architecture

5.2.3 Data hazards

The data hazards due to the iterative decoding mentioned in section 4.1 idle the decoding procedure and degrade the utilization. The intra-codeword parallel architecture provides simple and efficient way to remove data hazards. Fig. 5.8 shows that the last few sub-codewords of the pre-decoding and post-decoding rounds in every iteration cause data hazards. The way to solve it is to arrange a proper decoding order of sub-codewords. In Fig.

5.9, the first sub-codeword of the post-decoding round can be decoded at the time we decoding the last few sub-codewords, if we decode the first two sub-codeword of the pre-decoding round first. Because the extrinsic values needed by the first sub-codeword of the post-decoding round has been calculated and stored in the memories, we can decode it without and data hazards. Therefore, a proper arrangement of decoding order would avoid the data hazards and without any hardware overhead. However, the number of the sub-codewords must be large enough for arrangement if the decoding latency is long.

Fig. 5.8 Data hazards

Fig. 5.9 A proper decoding order for data hazards

5.3 Performance Analysis

The sectionalized method partition a codeword by storing initial values. From Fig. 5.10 to Fig.5.14, they show the performance of different section sizes. Obviously, the 64T and 32T almost have no performance loss with the same iteration. The loss for each case is less than 0.01dB. However, from 16T to 4T, the performance is getting worse. The performance of BER convergence of the smaller section is worse than the bigger one.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

5.4 Proposed Method to Improve Performance

Since the performance is getting worse and diverge with small section size, we propose a method to improve the performance. First, we would like to figure out how the performance degrades and then we can find approaches to improve it. The degradation of performance may be formed by two factors: the fist one is the initial zero in the first iteration. The second is the initial values from the previous iteration. The proposed method extend the path metric recursion to more Trellis stage, Which means we would like to accumulate more Trellis stage in this iteration. Fig. 5.15(a) shows that if we access the initial values from the earlier sections and accumulate more correct path metric in this iteration, the performance increase as long as we calculating a path metric long enough. The longer we accumulate, the better the performance is. Fig. 5.16 and Fig. 5.17 show the case of 8T extending to 16T and 4T extending to 8T, 12T, and 16T. The effect of the initial zero can also be found in Fig. 5.16 and Fig. 5.17. It is unapparent to claim that the initial zero is the major factor, but it is obvious that

extension improve the performance greatly.

Fig. 5.15 A proposed method to improve performance

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10^-5

10^-4 10^-3 10^-2 10^-1

Eb/No(dB)

BER

3Gpp SW=32 8iter 3Gpp SW=32 12iter 8T extend 16T 8iter given initial at first iteration Original 8T 8iter

Fig. 5.16 8T extend to 16T

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

As we mentioned in section 5.1, with a fixed block length, the smaller we partition, the more decoders we have. In this section, we would like to apply more decoders for a higher throughput. Fig. 5.18(a) shows the notation of decoding schedule. The circles and the squares denote the initial storage of α and β. The decoding schedule in Fig.5.18(b), (c) and (d) are the original 4T, 8T, and 16T cases individually. If we can make a proper decoding schedule, the number of decoders can be doubled for higher throughput. Comparing with Fig. 5.18(c) and Fig. 5.19(b), the number of decoders is doubled without any overhead, and the decoding latency is shortened. Moreover, the number of decoders in Fig. 5.19(b) is equal to the 4T case, which means that we achieve the 4T case throughput with 8T case overhead. The decoder number of 4T case can not be doubled because the design is based on radix 4x4 design. Fig.

5.18(a) shows that one step of vertical axe means reading 4 symbols per cycle. The Fig. 5.20

shows the decoding schedule of the extension verision.

Fig. 5.18 The original decoding schedule

Fig. 5.19 example of the new 8T and 16T decoding schedule

Fig. 5.20 Decoding schedule of extension

5.6 Hardware Comparison

Table 5.1 and Table 5.2 list the comparison between original design and the proposed two architectures. It is obvious that the storage different will be affected by three parameters, N, n, n0, and M. the most important parameters is M, because the greater M makes the storage reduction larger. In other word, if we apply more processing elements in our design, the total storage compared with the original SW case may be reduced. If the case is a medium M, there will have some storage overhead. However, the reduction of ACS will save the area and gate count apparently.

Table 5.1 Hardware comparison of two-dimension parallel architecture

Table 5.2 Hardware comparison of intra-codeword parallel architecture

5.7 Summary

In this chapter, we modified and combined the concept in [18] with our original design to innovate a new two-dimension parallel structure. The performances with different section sizes have been analyzed for different applications. A method to improve the performance convergence is proposed with reasonable hardware cost. Two parallel architectures are proposed for different design constraints and modified hazard-free method is discussed in section 5.2.3. a double throughput scheduling method is proposed for highly parallelism.

Meanwhile, the parametric hardware comparisons are list in Table 5.1 and Table 5.2 with example and they can be quick reviewed before design. This chapter facilitates the ultra high speed turbo decoder design and makes the parallel decoding complete.

Chapter 6 Conclusion and Future Work

6.1 Conclusion

In this thesis, we proposed two turbo decoders with the parallel architecture which enables multiple processing elements to decode one codeword concurrently. The proposed IBP interleaver connects all processing elements with a easily implemented structure and avoids the limit of the forward and backward recursions.

In the first design, we also introduce a high speed methodology for high radix decoder structure with a matching contention-free IBP interleaver. The combination of two stages ACS and the retiming technique efficiently speed up the decoding throughput with acceptable hardware cost. The energy efficiency of proposed turbo decoder is much smaller than that of the state of the art.

In the second 1Gbps design, we modified the utilization of processing elements and made the decoding schedule more efficient with doubled throughput. The implementation of interleaver is more flexible than the previous design and the proposed 1Gbps turbo decoder can support multiple code lengths. The proposed 1Gbps turbo decoder is the most power efficient and the fastest turbo decoder chip which achieves 1Gbps throughput in the state of the art.

In chapter 5, we proposed a combined method to make the parallelism work in two dimensions. The performance and the hardware cost with different condition have been analyzed and a new extension method and a new scheduling method are proposed to improve the performance and the throughput.

6.2 Future Work

Up to now, the early termination scheme is regarded as the most efficient way to reduce the power consumption in turbo decoders. It uses several characteristics in turbo decoding to judge if decoding sequence is nearly correct before maximum iteration number is achieved.

Once iterative decoding can be stopped earlier, then the power can be saved. In [37], an iteration stopping criterion has been modified based on the cross entropy between the a posteriori probabilities of two SISO decoders for each iteration. Some other simplified criteria was proposed in [38] and [39]. Most of these criteria make the decoder idle for saving power. The idea of utilization mentioned in chapter 4 will be useful for thinking of a new stopping criterion, which should be more precisely called “skipping criterion.” If we set a

“skipping criterion” for all sub-codewords in our proposed intra-codeword parallel architecture, the decoder will skip decoding certain sub-codewords which is meet the

“skipping criterion.” Therefore, the decoding procedure would be more efficient and the throughput will increase as the channel condition going better.

Bibliography

[1] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error- correcting coding and decoding: Turbo-codes (1),” in Proc. IEEE Int. Conf. on Commun., Geneva, Switzerland, May 1993, pp. 1064–1070.

[2] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol,” IEEE Trans. Inform. Theory, no. IT-20, pp. 284–287, Mar. 1974.

[3] J. Hagenauer, E. Offer, and L. Papke, “Iterative decoding of binary block and convolutional codes,” IEEE Trans. Inform. Theory, vol. 42, no. 2, pp. 429-445, Mar.

1996.

[4] J. Hagenauer and P. Hoeher, “A Viterbi Algorithm with Soft-decision Outputs and its Applications,” in IEEE GLOBE-COM, Dallas, TX, pp. 47.1.1-47.1.7, Nov. 1989.

[5] G. Solomon and H. C. A. van Tilborg, “A connection between block and convolutional codes,” SIAM J. Appl. Math., vol. 37, pp. 358–369, Oct. 1979.

[6] H. H. Ma and J. K. Wolf, “On tail biting convolutional codes,” IEEE Trans. Commun., vol. COM-34, pp. 104–111, Feb. 1986.

[7] C. Weiss, C. Bettstetter, S. Riedel, and D. J. Costello, “Turbo decoding with tailbiting trellises,” in Proc. URSI Int. Symp. Signals, Systems, Electronics, 1998, pp. 343–348.

[8] C. Weiss, C. Bettstetter, and S. Riedel, “Code construction and decoding of parallel concatenated tail-biting codes,” IEEE Trans. Inform. Theory, vol. 47, pp. 366-386, Jan.

2001.

[9] J. Sun and O. Y. Takeshita, ”Interleavers for Turbo codes using permutation polynomials over integer rings,” IEEE Trans. Inform. Theory, vol. 51, no. 1, pp. 101-119, Jan. 2005.

[10] P. Robertson, E.Villebrun and P. Hoeher, “A Comparison of Optimal and Sub-optimal MAP Decoding Algorithms operating in the Log Domain,” Proc. ICC’95, Seattle, June 1995.

[11] M. Bickerstaff, L. Davis, C. Thomas, D. Garrett, and C. Nicol, “A 24Mb/s radix-4 logMAP turbo decoder for 3GPP-HSDPA mobile wireless,” in ISSCC Dig. Tech. Papers, 2003, pp. 151–484.

[12] B. Bougard, A. Giulietti, V. Derudder, J. Willem, S. Dupont, L. Hollevoet, F. Catthoor, L. V. der Perre, H. D. Man, and R. Lauwereins, “A scalable 8.7nj/bit 75.6Mb/s parallel concatenated convolutional (turbo-)codec,” in ISSCC Dig. Tech. Papers, 2003, pp.

152–484.

[13] Y. Zheng, “Network for permutation or de-permutation utilized by channel coding algorithm,” U.S. Patent Pending.

[14] C. H. Tang, C. C. Wong, C. L. Chen, C. C. Lin, and H. C. Chang, “A 952Mb/s Max-Log MAP decoder chip using radix-4×4 ACS architecture,” in IEEE A-SSCC, 2006, pp.

79–82.

[15] C. B. Shung, P. H. Siegel, G. Ungerboeck, and H. K. Thapar, “VLSI architectures for metric normalization in the Viterbi algorithm,” in Int. Conf. Communications, vol. 4, Atlanta, CA, Apr. 1990, pp. 1723–1728.

[16] P. Urard, L. Paumier, M. Viollet, E. Lantreibecq, H. Michel, S. Muroor, and B. Gupta,

“A generic 350Mb/s turbo-codec based on a 16-states SISO decoder,” in ISSCC Dig.

Tech. Papers, 2004, pp. 424–536.

[17] Z. He, P. Fortier, and S. Roy, “Highly parallel decoding architectures for convolutional turbo codes,” IEEE Trans. VLSI Syst., vol. 14, no. 10, Oct. 2006.

[18] S. Yoon, and Y. Bar-Ness, “A Parallel MAP Algorithm for Low Latency Turbo Decoding,” IEEE Commun. Lett., vol. 6, no. 7, pp.288-290, Jul. 2002.

[19] J. H. Andersen, “‘Turbo’ Coding for Deep Space Application,” in IEEE International Symposium on Inform. Theory, 17-22, pp.36, Sep. 1995.

[20] D.Divsalar, S. Dolinar, R. J. McEliece, and F. Pollara, “Performance Analysis of Turbo Codes,” in IEEE Military Communication conf., vol. 1, 5-8, pp. 91-96, Nov. 1995.

[21] J. Hagenauer and P. Hoeher, “A Viterbi Algorithm with Soft-decision Outputs and its Applications,” in IEEE GLOBE-COM, Dallas, TX, pp. 47.1.1-47.1.7, Nov. 1989.

[22] J. H. Andersen, “‘Turbo’ Coding for Deep Space Application,” in IEEE International Symposium on Inform. Theory, 17-22, pp.36, Sep. 1995.

[23] J. H. Andersen, “Turbo codes extended with outer BCH code,” in Electronics Letters, vol. 32, no. 22, 24, pp.2059-2060, Oct. 1996.

[24] J. A. Erfanian, S. Pasupathy, and G.Gulak, “Reduced Complexity Symbol Detectors with Parallel Structures for ISI Channels,” IEEE Trans. Commun., vol. 42, no. 2/3/4, pp.1261-1271, Feb./Mar./Apr. 1994.

[25] T. A. Summers and S. G. Wilson, “SNR Mismatch and Online Estimation in Turbo Decoding,” IEEE Trans. Commun., vol. 46, pp.421-423, Apr. 1998.

[26] A. Worm, P. Hoeher, N. Wehn, “Turbo-Decoding Without SNR Estimation,” IEEE Commun. Letters, vol. 4, no. 6, pp.193-195, June 2000.

[27] S. A. Barbulescu, “Iterative decoding of turbo codes and other concatenated codes,”

University of South Australia, PhD Dissertation, Aug. 1995.

[28] S. A. Barbulescu, “On Sliding Window and Interleaver Design,” Electronics Letters, vol.

37, no. 21, pp.1299-1300, Oct. 2001.

[29] A. J. Viterbi, “Error bounds for convolutional codes and asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol. IT-13, no. 2, pp.260-269, Mar.

1973.

[30] Y. Wu and B. D. Woerner, “Internal data width in SISO decoding module with modular renormalization,” in IEEE Vehic. Tech. Conf., vol. 1, pp. 675-679, May 2000.

在文檔中 Gbps高速渦輪碼之設計與實現 (頁 56-0)