適用於第三代無線行動通訊之雙模式通道解碼器的設計

全文

(1)國立交通大學電子工程學系碩. 電子研究所碩士班士. 論. 文. 適用於第三代無線行動通訊之雙模式通道解碼器的設計. A Dual Mode Channel Decoder for 3GPP2 Mobile Wireless Communications. 學生：施彥旭指導教授：李鎮宜教授. 中華民國九十三年六月.

(2) 適用於第三代無線行動通訊之雙模式通道解碼器的設計 A Dual Mode Channel Decoder for 3GPP2 Mobile Wireless Communications. 研究生：施彥旭. Student：Yen-Hsu Shih. 指導教授：李鎮宜. Advisor：Chen-Yi Lee. 國立交通大學電子工程學系電子研究所碩士班碩士論文. A Thesis Submitted to Department of Electronics College of Electrical Engineering and Computer Science National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master of Science in Electronics Engineering June 2004 Hsinchu, Taiwan, Republic of China. 中華民國九十三年六月.

(3) 適用於第三代無線行動通訊之雙模式通道解碼器的設計. 學生：施彥旭. 指導教授：李鎮宜教授. 國立交通大學電子工程學系電子研究所碩士班. 摘要近年來，由於無線通訊應用的快速成長，前置錯誤更正碼的重要性與日俱增。其中，具有高解碼能力的迴旋碼與渦輪碼皆被廣泛運用。以第三代無線行動通訊標準為例，迴旋碼與渦輪碼同時被規範為其通道解碼器。以硬體實現的觀點而言，為兩者分別設計其專用的解碼器並不經濟。. 本論文主旨即在設計一適用於第三代行動通訊標準之雙模式通道解碼器。此設計同時支援最大區塊長度為 20,730 之渦輪解碼器及不同編碼率的維特比解碼器。其最大資料輸出率分別可達每秒 4.52Mb及 5.26Mb。為減少外部記憶體的存取次數，我們採用一快取記憶體作為輸入緩衝器。經由有效率的打散器設計，記憶體需求與控制單元複雜度均可有效縮小。本架構以Verilog硬體描述語言撰寫，合成後邏輯閘數大約為 11 萬 5 千個。使用 0.18um製程實作晶片，晶片面積約 11.56mm2，經過儀器量測在 100MHz的時脈速度可正常運作，其晶片面積僅 11.56mm2。以固定六次迴圈的渦輪碼解碼模式下，平均只需要 83mW的功率消耗即可達到標準所規範的最大資料輸出率每秒 3.09Mb。.

(4) A Dual Mode Channel Decoder for 3GPP2 Mobile Wireless Communications. Student : Yen-Hsu Shih. Advisor : Dr. Chen-Yi Lee. Institute of Electronics Engineering National Chiao Tung University. ABSTRACT In the recent years, forward error correction schemes is rising and flourishing due to widespread increase of wireless communication applications. Among these standards, turbo codes and convolutional codes are usually adopted at the same time because of their high error correcting ability. However, to design the hardware functional block for each decoder is inefficient. In this thesis, a unified turbo and Viterbi decoder architecture for 3GPP2 standard is presented. The turbo decoding with a maximum block length of 20,730 and Viterbi decoding with various code rates are implemented to provide maximum data rate of 4.52Mb/s and 5.26Mb/s respectively at a clock rate of 100MHz. The memory access is reduced by the input caching scheme, and the system complexity is lowered by the efficient interleaver design. This chip is fabricated in a 0.18 µm six-metal standard CMOS process. The chip die size is 3.4 x 3.4 mm2 with the core size of 2.7 x 2.7 mm2. It contains 115k gates excluding the embedded memory. The measured power dissipation is 83mW while working at the clock rate of 66MHz to decode a 3.1Mb/s turbo encoded data stream with six iterations..

(5) 誌. 謝. 隨著鳳凰花開，轉眼間又到了畢業的季節。在這二年的碩士生涯中，首先我要向指導教授李鎮宜博士表達最誠摯的謝意。由於老師指導有方，讓我能在短時間內找到正確的研究方向；在遇到挫折時也能從經驗中學習，培養正確的研究精神。另外，我也要感謝 Si2 實驗室中的每一位成員。在這裡的每個人研究領域或有不同，但都願意彼此幫助，讓我不僅了解團隊工作的重要性，更令人倍感溫馨；尤其我要感謝林建青學長，在我研究過程中不厭其煩地提供不少建議。最後，我要謝謝在背後默默支持著我的家人和朋友，讓我順利完成了這份學業。在大家的鼓勵下，讓我過得更多采多姿，我一定不會忘記這段令人充滿回憶的生活。.

(6) Contents CHAPTER 1 INTRODUCTION.............................................................................................1 1.1. MOTIVATION ................................................................................................................1. 1.2. DESIGN SPECIFICATION ................................................................................................3. 1.3. THESIS ORGANIZATION ................................................................................................3. CHAPTER 2 TURBO CODING AND DECODING.............................................................4 2.1. PRINCIPLE OF TURBO CODEC .......................................................................................4. 2.1.1. Turbo Encoding ..................................................................................................4. 2.1.2. Turbo Interleaver................................................................................................6. 2.1.3. Turbo Decoding ..................................................................................................7. 2.1.4. Error floor effect.................................................................................................9. 2.2. MAP DECODING ALGORITHM FOR TURBO DECODING ...............................................10. 2.2.1. The MAP algorithm ..........................................................................................10. 2.2.2. The Max-Log-MAP algorithm ..........................................................................14. 2.2.3. The Log-MAP algorithm...................................................................................15. 2.2.4. SNR sensitivity of Max-Log-MAP and Log-MAP algorithm ............................17. 2.3. SLIDING WINDOWED APPROACH ...............................................................................17. CHAPTER 3 PRINCIPLE OF CONVOLUTIONAL CODEC..........................................20 3.1. CONVOLUTIONAL CODE ............................................................................................20 i.

(7) 3.2. VITERBI DECODING ...................................................................................................21. 3.3. TRACE-BACK METHOD ..............................................................................................23. 3.4. SUMMARY .................................................................................................................25. CHAPTER 4 FIXED POINT ANALYSIS OF DUAL MODE TURBO/VITERBI DECODER ..............................................................................................................................26 4.1. FIXED POINT ANALYSIS FOR TURBO DECODER ..........................................................26. 4.1.1. Input LLR Representation.................................................................................26. 4.1.2. Extrinsic Data Representation..........................................................................29. 4.1.3. Bit width of Internal Variables..........................................................................30. 4.1.4. Performance under Fixed Point Simulation.....................................................34. 4.2. FIXED POINT ANALYSIS OF VITERBI DECODER ..........................................................35. 4.3. SUMMARY .................................................................................................................39. CHAPTER 5 ARCHITECTURE OF PROPOSED DUAL MODE TURBO/VITERBI DECODER ..............................................................................................................................40 5.1. ARCHITECTURE OF INTEGRATED TURBO/VITERBI DECODER .....................................40. 5.2. ARCHITECTURE OF TURBO DECODER ........................................................................42. 5.2.1. Single MAP Decoder design.............................................................................43. 5.2.2. Cache design ....................................................................................................44. 5.2.3. Transition Metric Unit (TMU)..........................................................................45. 5.2.4. ACS Unit ...........................................................................................................46. 5.2.5. LLR unit ............................................................................................................47. 5.2.6. Interleaver design .............................................................................................48. 5.3. ARCHITECTURE OF VITERBI DECODER ......................................................................50. 5.3.1. Transition Metric Unit (TMU)..........................................................................51. 5.3.2. Survivor Memory Management ........................................................................52 ii.

(8) 5.4. SUMMARY .................................................................................................................55. CHAPTER 6 CHIP IMPLEMENTATION ..........................................................................56 6.1. CHIP SPECIFICATION ...................................................................................................56. 6.2. COMPARISON WITH OTHER SIMILAR WORK.................................................................58. 6.3. SUMMARY .................................................................................................................59. CHAPTER 7 CONCLUSION AND FUTURE WORK.......................................................60 7.1. CONCLUSION .............................................................................................................60. 7.2. FUTURE WORK ..........................................................................................................60. BIBLIOGRAPHY...................................................................................................................61. iii.

(9) List of Figures FIG. 2.1 TURBO ENCODER FOR 3GPP2 STANDARD .......................................................................5 FIG. 2.2 TRELLIS TERMINATION ...................................................................................................5 FIG. 2.3 TURBO INTERLEAVER FOR 3GPP2 STANDARD ................................................................6 FIG. 2.4 TURBO DECODING FLOWCHART ......................................................................................8 FIG. 2.5 PERFORMANCE COMPARISON UNDER DIFFERENT ITERATION NUMBERS IN 3GPP2 STANDARD ...........................................................................................................................9. FIG. 2.6 TRELLIS DIAGRAM OF TURBO CODE IN 3GPP2 STANDARD ............................................ 11 FIG. 2.7 THE WINDOWED MAP ALGORITHM ..............................................................................18 FIG. 2.8: PERFORMANCE COMPARISON AMONG DIFFERENT SUB-BLOCK LENGTHS IN 3GPP2 STANDARD .........................................................................................................................19. FIG. 3.1 RATE 1/2 CONVOLUTIONAL ENCODER WITH THE CONSTRAINT LENGTH OF 9.................20 FIG. 3.2 A SYSTEM PLATFORM OF THE CONVOLUTIONAL CODEC ................................................21 FIG. 3.3 TRELLIS DIAGRAM OF THE (2, 1, 2) CONVOLUTIONAL CODE AND ITS SURVIVOR MEMORY ..........................................................................................................................................23 FIG. 3.4 TRACE-BACK PROCEDURE OF THE CONVOLUTIONAL CODE ...........................................24 FIG. 3.5 THE SIMULATION RESULT OF VITERBI DECODER UNDER DIFFERENT TRACE-BACK LENGTHS ............................................................................................................................25. FIG. 4.1 FIXED POINT SIMULATION RESULT OF THE INPUT SYMBOLS WITH BPSK MODULATION AND RATE 1/2 TURBO DECODING ........................................................................................28. FIG. 4.2 FIXED POINT SIMULATION RESULT OF THE INPUT SYMBOLS WITH 16-QAM MODULATION iv.

(10) AND RATE 1/5 TURBO DECODING ........................................................................................28. FIG. 4.3 FIXED POINT SIMULATION RESULT OF THE EXTRINSIC INFORMATION .............................30 FIG. 4.4 AN EIGHT-STATE TRELLIS DIAGRAM ILLUSTRATING MESSAGE PASSING WITHIN 3 TRELLIS SECTIONS. ..........................................................................................................................32. FIG. 4.5 FIXED POINT ANALYSIS WITH DIFFERENT BIT-WIDTH OF PATH METRICS IN TURBO DECODING ..........................................................................................................................34. FIG. 4.6 DESIGN LOSS OF FIXED POINT TURBO DECODER ............................................................35 FIG. 4.7 THE PERFORMANCE OF VITERBI DECODER WITH DIFFERENT QUANTIZATION LEVELS ....36 FIG. 4.8 THE CONVERGENCE OF ANY TWO SURVIVOR PATHS IN VITERBI ALGORITHM .................37 FIG. 4.9 THE PERFORMANCE OF VITERBI DECODER UNDER DIFFERENT BIT-WIDTHS OF PATH METRIC ..............................................................................................................................38. FIG. 4.10 THE PERFORMANCE ANALYSIS ON SYSTEM PERFORMANCE FOR EACH KIND OF CODE RATE ..................................................................................................................................38. FIG. 5.1 THE PROPOSED ARCHITECTURE OF INTEGRATED TURBO/VITERBI DECODER ..................41 FIG. 5.2 THE ARCHITECTURE OF INTEGRATED TURBO/VITERBI DECODER IN TURBO MODE .........42 FIG. 5.3 A SINGLE MAP DECODER ARCHITECTURE FOR TURBO DECODING.................................43 FIG. 5.4 THE INPUT CACHE ARCHITECTURE ................................................................................44 FIG. 5.5 THE DETAIL TIMING CHART OF THE PROPOSED INPUT CACHE .........................................45 FIG. 5.6 THE TMU ARCHITECTURE FOR TURBO DECODER ..........................................................46 FIG. 5.7 THE ACS ARCHITECTURE FOR DUAL MODE TURBO/VITERBI DECODER .........................47 FIG. 5.8 THE LLR UNIT ARCHITECTURE FOR TURBO DECODER...................................................48 FIG. 5.9 THE ARCHITECTURE OF SHARED MEMORY DESIGN IN TURBO DECODER .........................49 FIG. 5.10 THE ADDRESS GENERATOR FOR THE INTERLEAVER OF 3GPP2 TURBO DECODER .........50 FIG. 5.11 THE ARCHITECTURE OF INTEGRATED TURBO/VITERBI DECODER IN VITERBI MODE .....51 FIG. 5.12 THE ARCHITECTURE OF TMU CELL ............................................................................52 FIG. 5.13 THE 3-POINTER EVEN ALGORITHM FOR SURVIVOR MEMORY MANAGEMENT ................54 v.

(11) FIG. 5.14 THE ARCHITECTURE OF SURVIVOR MEMORY MANAGEMENT ........................................54 FIG. 6.1 THE MICROPHOTO OF THE DECODER CHIP .....................................................................57. vi.

(12) List of Tables TABLE 1.1: DIFFERENCE OF FEC SPECIFICATION BETWEEN 3GPP AND 3GPP2 STANDARDS........2 TABLE 1.2 SPECIFICATION OF CHANNEL CODING IN 3GPP2 STANDARD .......................................3 TABLE 2.1 TURBO INTERLEAVER PARAMETERS ............................................................................7 TABLE 4.1 SUMMARY OF BIT-WIDTH DECISION FOR TURBO MODE ..............................................39 TABLE 4.2 SUMMARY OF BIT-WIDTH DECISION FOR VITERBI MODE ............................................39 TABLE 4.3 A COMPARISON OF BIT-WIDTH DECISION WITH [19] FOR TURBO MODE ......................39 TABLE 6.1 SUMMARY OF THE DECODER CHIP .............................................................................57 TABLE 6.2 POWER MEASUREMENT RESULT OF THE DECODER CHIP .............................................58 TABLE 6.3 COMPARISON WITH OTHER SIMILAR WORK ...............................................................59. vii.

(13) Chapter 1 Introduction. 1.1. Motivation In the last decade, the digital technologies were introduced for wireless communications. to replace analogy system due to increased traffic, speech privacy, new services (data transmission), and robustness of transmission (enhanced coding technique). These include the global system of mobile communications (GSM) which is a mobile radio standard with the most subscribers worldwide. It adopts time-division multiple access (TDMA) system and provides the maximum data rate of 9.6 Kb/s. Obviously, this data rate cannot satisfy the demand of multimedia document transfer in 21st century. Besides, the number of customer is increasing much faster than expected. However, the TDMA system is a dimension-limited system. The number of dimensions is determined by the number of time slot. No additional users are allowed once all time slots are assigned. Hence, third generation mobile radio systems are proposed to meet the market requirement with higher rate data service and higher capacity.. Up to now, Third Generation Partnership Projects 3GPP [1] and 3GPP2 [2] defined detail standard providing maximum data rate of about 2 Mb/s and 3 Mb/s, respectively. On the other hand, code division multiple access (CDMA) proposal was brought up due to its higher capacity. Unlike TDMA system, the number of CDMA channels depends on the level of total interference that can be tolerated in the system. Forward error correction code plays an 1.

(14) important role here since it can improve the tolerance for interference and thus increase the capacity of CDMA system. In both 3GPP and 3GPP2 standards, turbo code and high complexity convolutional code ((3, 1, 8) and (6, 1, 8) respectively) are instituted so that transmission quality can be guaranteed.. In 3GPP2 mobile wireless communication standard, larger block length is specified in turbo code while comparing with 3GPP standard (about 4 times larger). Besides, the code rate of turbo code and convolutional code are reduced to 1/5 and 1/6 respectively. These make the integration of Viterbi decoder and turbo decoder more difficult due to higher complexity. Moreover, 3GPP2 standard provides higher data rate of up to 3.09 Mb/s, which indicates an intensive demand of memory bandwidth. The above-mentioned will cause more challenge while designing the unified decoder if embedded memory size and power dissipation issues are taken into account. Hence, we would like to address a solution for channel decoder compatible with 3GPP2 standard to solve both area and power problems by proposing a novel architecture to implement a dual mode turbo/Viterbi decoder. Table 1.1 shows the main difference of 3GPP and 3GPP2 standards in turbo code and convolutional code.. Table 1.1: Difference of FEC specification between 3GPP and 3GPP2 standards. 3GPP. 3GPP2. Code Rate. 1/3. 1/5. Maximum block length. 5114. 20730. Code Rate. 1/2, 1/3. 1/2, 1/3, 1/4, or 1/6. Maximum Data Rate. 2 Mb/s. 3.09 Mb/s. Turbo Code. Convolutional Code. 2.

(15) 1.2. Design Specification Our objective is to design a turbo and Viterbi decoder single chip for 3GPP2 standard.. The detail specification of turbo code and convolutional code is listed in Table 1.2. Table 1.2 Specification of channel coding in 3GPP2 standard Constraint length Turbo Code. Generator function. 4 ⎡ 1 + D + D3 ⎢1 1 + D 2 + D 3 ⎣. 1 + D + D 2 + D3 ⎤ 1 + D 2 + D 3 ⎥⎦. Required data rate. 3.09 Mb/s. Constraint length. 9. Generator function (CR=1/6). [457, 755, 511, 637, 625, 727](octal). Generator function (CR=1/4). [765, 671, 513, 473] (octal). Generator function (CR=1/3). [557, 663, 711] (octal). Generator function (CR=1/2). [753, 561] (octal). Required data rate. 1.036 Mb/s. Convolutional Code. 1.3. Thesis Organization This thesis consists of 7 chapters. In chapter 2, we’ll focus on interpreting turbo coding. and decoding algorithm and its relative techniques. The reader is assumed to be familiar with Viterbi algorithm and thus only a brief description is made in Chapter 3. Chapter 4 explains how we decide the fix-point resolution in our design. Some simulation results will also be shown here. In chapter 5, we present the proposed architecture. For clearness, operating flow for turbo mode and Viterbi mode will be discussed separately. In addition, several characteristic of our design will be stated here. Chapter 6 outlines the specification of our implemented chip. We also provide some comparisons with other similar works. Finally, conclusion and future work are made in chapter 7. 3.

(16) Chapter 2 Turbo Coding and Decoding The parallel concatenated convolutional code (PCCC), named turbo code [3], was first proposed by C. Berrou, A. Glavieux, and P. Thitimajshima in 1993. It has been proved to have a performance close to Shannon limit with simple constituent codes concatenated by an interleaver. This new technique is now adopted in both 3GPP and 3GPP2 standards due to its excellent error correction ability. In this chapter, we’ll describe the principle of both turbo encoding and turbo decoding methods. The error floor effect in turbo decoding and some decoding techniques will also be interpreted here.. 2.1. Principle of Turbo codec. 2.1.1. Turbo Encoding. The turbo encoder is composed of two recursive systematic convolutional (RSC) encoders, which are connected in parallel but separated by a turbo interleaver. The two RSC encoders are also called constituent codes of the turbo code. The block diagram of the turbo encoder is illustrated in Fig. 2.1. Note that the same input data are encoded by each RSC encoder but in different order. In 3GPP2 standard, each input bit is encoded as one systematic bit and two parity-check bits for each RSC encoder. Thus, the code rate of each component encoder is 1/3. In order to increase the code rate of turbo code, the systematic bits of the second RSC encoder are not transmitted. Therefore, the output encoded sequence should be {X, Y0, Y1, Y0’, Y1’}, and the overall code rate is 1/5.. 4.

(17) X Y0. Y1. Input message. Control X＇ Y0'. Turbo Interleaver. Y1'. Control. Fig. 2.1 Turbo encoder for 3GPP2 standard. After encoding all input messages, we have to generate several tail bits to set both component encoders back to zero state. However, it’s impossible for a RSC encoder to return zero state by inserting dummy zeros into the encoder directly. Thus, a simple solution is provided in Fig. 2.2. While encoding input messages, the switch is set to position “A”. Once messages of whole block are encoded, the position of switch is changed to “B” for three additional cycles. This will force all registers to zeros and thus back to zero state.. Systematic bit Parity-check bit A Input message B. Fig. 2.2 Trellis Termination. 5.

(18) 2.1.2. Turbo Interleaver. The interleaver plays a very important role in turbo encoder. First of all, a proper coding gain can be achieved with small memory RSC encoders since the interleaver scrambles a long block message. Besides, the interleaver de-correlates the input of two RSC encoders so that iterative decoding algorithm can be applied between two component decoders. Theoretically, the block size of interleaver is one of the major factors to lower the upper bound on bit error probability of the turbo code system. The performance upper-bound of turbo code corresponding to a uniform random interleaver has been evaluated in [4]. The result shows that the bit-error-probability upper bound of turbo code is approximately proportional to 1/N, where N is the block size of turbo interleaver. The factor “1/N” is also called the interleaver gain.. Fig. 2.3 shows the address generator of turbo interleaver in 3GPP2 standard. It provides a maximum block size of 20,730 and minimum block size of 378. Detail supported block sizes and its corresponding “n” value are listed in Table 2.1.. n MSBs. Add 1 and Select the n LSBs. Table Lookup. (n+5)-bit counter. 5 LSBs (i4…i0). Bit Reverse. n bits. n bits. MSBs. Multiply and Select the n LSBs. n bits. LSBs. 5 bits (i0…i4). Fig. 2.3 Turbo Interleaver for 3GPP2 standard. 6. Discard If Input≧ Nturbo. Interleaver output address.

(19) Table 2.1 Turbo interleaver parameters Turbo Interleaver Block size Turbo interleaver parameter (n). 2.1.3. 378. 4. 402. 4. 570. 5. 762. 5. 786. 5. 1,146. 6. 1,530. 6. 1,554. 6. 2,298. 7. 2,322. 7. 3,066. 7. 3,090. 7. 3,858. 7. 4,602. 8. 6,138. 8. 9,210. 9. 12,282. 9. 20,730. 10. Turbo Decoding. A general idea for iterative turbo decoding is illustrated in Fig. 2.4, where rs is the received systematic information, rp1 is the received parity information generated by the first 7.

(20) RSC encoder, and rp2 is the received parity information generated by the second RSC encoder. The iterative turbo decoding consists of two constituent decoders, which are soft-in/soft-out (SISO) decoders concatenated serially via one interleaver and one de-interleaver. An additional interleaver is used to interleave the input systematic information and then provides the interleaved data to the second SISO decoder. Two component decoders can be implemented based on either soft-output Viterbi algorithm (SOVA) [5] or maximum a posteriori probability (MAP) algorithm [6], which will be discussed particularly in the next section. During iterative decoding process, each constituent decoder delivers the extrinsic information Lex(u) which is taken as a priori information for the other constituent decoder. That is Lin1 (uk ) = Lex 2 (uk ) and Lin 2 (uk ) = Lex1 (uk ) . As the number of iterations increases,. better coding gain is expected. However, the correlation between two SISO decoders is also raised up. Therefore, there is no significant performance improvement if the number of iterations reaches a threshold. Fig. 2.5 shows the performance comparison under different iteration numbers in 3GPP2 standard.. ~ Lex2(u). Lex2(u). DeInterleaver ~ ^ Lex1(u). Lex1(u) Interleaver rs. SISO Decoder1. SISO Decoder2. L1(u). rp1 Interleaver rp2. Fig. 2.4 Turbo decoding flowchart. 8. ^ L2(u).

(21) Performance of Turbo Decoder under different iteration number (N=20730, 16-QAM, Code Rate=1/5). 0. 10. 1 iteration 4 iterations 6 iterations 8 iterations 10 iterations. -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -1. -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. 4. SNR. Fig. 2.5 Performance comparison under different iteration numbers in 3GPP2 standard. 2.1.4. Error floor effect. Although turbo coding provides an excellent performance, the bit-error-rate certainly starts to decrease quite slowly at high signal-to-noise ratio (SNR). This phenomenon can be observed in Fig. 2.5. It is due to relative small free distance of turbo codes, and is called an “error floor” [7]. Consider the relation of the minimum free distance and the bit error probability in turbo coding, which can be expressed by ⎛ E Pb ∝ Q ⎜ 2d free R b ⎜ N0 ⎝. ⎞ ⎟⎟ ⎠. (2. 1). where dfree is the minimum free distance and Eb/N0 is the SNR. At low SNR, the major part of errors can be corrected by iterative decoding since systematic information and parity information can be regarded as highly independent events. However, as the channel provides 9.

(22) a reliable transmission, the dependency of the systematic and parity information grows up and the interleaver does little contribution on iterative decoding. Thus, the error correction ability is limited on the weak constituent code only. To overcome this issue, we can increase the interleaver size to lower the position of the error floor or concatenate a block code, e.g. BCH code, as an outer code to remove the left error bits. For more details, please refer to [4] [8].. 2.2. MAP Decoding algorithm for Turbo Decoding It has been proved that the MAP algorithm is the optimal decoding method for turbo. code while comparing with SOVA [9]. Unlike Viterbi algorithm which utilizes maximum likelihood (ML) algorithm to find the codewords with minimum error probability, the MAP algorithm minimizes the symbol (or bit) error probability. In this section, we’ll focus on introducing the turbo decoding methods based on MAP algorithm [6] [10]. Although SOVA is also one of the commonly used techniques for turbo decoding, we’ll skip it since it’s not adopted in our proposed design. To understand more detail about SOVA, please refer to [5]. And some comparisons of MAP algorithm and SOVA applied in turbo code system are shown in [9].. 2.2.1. The MAP algorithm. The main idea of MAP algorithm is to compute the log-likelihood ratio (LLR) of the transmitted information bit uk conditioned on the received information rk for 1≦k≦N, where N is the block length of encoded message.. L(uˆk ) = L(uk | r ) = log. P(uk = +1| r ) P(uk = −1| r ). (2. 2). Here r is the vector of received soft values, and can be represented as [r1,r2, …, rn] where n is the number of output bits for each encoded bit in the constituent code. Let’s consider the trellis diagram of turbo code in 3GPP2 standard, which is shown in Fig. 2.6 as an example. 10.

(23) Note that the solid lines represent the transitions corresponding to an information bit uk of -1, while the dotted lines represent the transitions corresponding to an information bit uk of +1. Then, the equation can be further expressed as. ∑ P(sk -1 ,sk , r ) P(uk = +1| r ) uk =+1 . L(uˆk ) = log = log P(uk = −1| r ) ∑ P(sk -1 ,sk ,r ). (2. 3). uk =-1. where the numerator and denominator are the sum of joint probabilities for all existing transitions from state sk-1 to state sk that corresponding to an information bit uk of +1 and -1 respectively.. Sk. uk= 1. Forward Direction for computing α. uk=+1. Sk+1 Backward Direction for computing β. Fig. 2.6 Trellis diagram of turbo code in 3GPP2 standard. Assume the encoded data is transmitted through the discrete memoryless channel (DMC), and then the term P(sk-1,sk,r) can be decomposed as three terms:. 11.

(24) P ( sk −1 , sk , r ) = P ( sk −1 , rj < k ) ⋅ P ( sk , rk | sk −1 ) ⋅ P(rj > k | sk ) . . α k −1 ( sk −1 ). =e. ⋅e. γ k ( sk −1 , sk ). ⋅e. .. (2. 4). β k ( sk ). Here eα k −1 ( sk −1 ) is the joint probability of state sk-1 and received symbols rj from the beginning of the block up to time index “k-1”. Similarly, e βk ( sk ) is that of state sk and received symbols rj from the end of block back to time index “k”. By shifting the value “k”, it can be perceived that α is the forward recursion of the MAP algorithm, and can be formulated as eα k ( sk ) = ∑ eγ k ( sk −1 , sk ) ⋅ eα k −1 ( sk −1 ) .. (2. 5). sk −1. The same as above, the backward recursion β can be formulated as e βk −1 ( sk −1 ) = ∑ eγ k ( sk −1 , sk ) ⋅ e βk ( sk ) .. (2. 6). sk. Note that since the trellis of turbo code diverges from state zero and converges to state zero, the initial condition of the forward recursion and backward recursion should be set as ⎧ eα0 ( s0 ) = 1, ⎨ α0 ( s0 ) = 0, ⎩e. for s0 = 0 otherwise. (2. 7). and ⎧ e β N ( sN ) = 1, ⎨ β N ( sN ) = 0, ⎩e. for sN = 0 otherwise. (2. 8). For any existing transitions from sk-1 to sk, the branch transition probability eγ k ( sk −1 , sk ) can be further decomposed as eγ k ( sk −1 , sk ) = P( sk , rk | sk −1 ) = P( sk | sk −1 ) ⋅ P (rk | sk −1 , sk ) .. (2. 9). = P(uk ) ⋅ P(rk | uk ). Here, the term “P(uk)” is well-known as a priori probability. According to the definition of LLR, which is L(uk ) = log. P (uk = +1) , P (uk = −1). P(uk) can be rewritten as 12. (2. 10).

(25) e ± L ( uk ) 1 + e ± L ( uk ) e − L ( uk ) / 2 L (uk )⋅uk / 2 = ⋅e 1 + e − L ( uk ) = Ak ⋅ e L ( uk )⋅uk / 2 .. P(uk = ±1) =. (2. 11). where the term Ak is equal for all transitions at the same time index, and thus will cancel out in (2. 3). On the other hand, the value of P(rk|uk) is dependent on channel characteristic. For an additive white Gaussian noise (AWGN) channel, the LLR of rk conditioned on uk can be expressed as. L(rk uk ) = log. P(rk | uk = +1) P(rk | uk = −1) n. Es. ∏ exp(− N. = log. v =1 uk =+1 n. (rk ,v − xk ,v ) 2 ). 0. E exp(− s (rk ,v − xk ,v ) 2 ) ∏ N0 v =1. (2. 12). uk =−1 n. = ∑ Lc ⋅ rk ,v ⋅ xk ,v v =1. where Lc=4Es/N0 and is called the channel reliability. Here, xk,v is the v-th transmitted symbol while encoding uk. For systematic codes, xk,1 is equal to uk. Now we can obtain the value of P(rk|uk) by using the technique in (2. 11) but substitute L(uk) with L(rk|uk). 1 1 n P (rk | uk ) = Bk ⋅ exp( Lc rk ,1uk + ∑ Lc rk ,v xk ,v ) 2 2 v=2. (2. 13). For the same reason in (2. 11), Bk will also cancel out in (2. 3). Combining (2. 11) and (2. 13), the γk in (2. 9) can be reduced to n ⎛1⎛ ⎞⎞ eγ k ( sk −1 , sk ) = Ak ⋅ Bk ⋅ exp ⎜ ⎜ ( Lc rk ,1 + L(uk )) ⋅ uk + ∑ Lc rk ,v xk ,v ⎟ ⎟ . v=2 ⎠⎠ ⎝2⎝. (2. 14). Substituting (2. 5), (2. 6), (2. 14) into (2. 4), we can derive the a posteriori LLR in the form of. 13.

(26) L(uˆk ) = log. ∑. eα k −1 ( sk −1 ) ⋅ eγ k ( sk −1 , sk ) ⋅ e β k ( sk ). ∑. eα k −1 ( sk −1 ) ⋅ eγ k ( sk −1 , sk ) ⋅ e β k ( sk ). ( sk −1 , sk ) uk =+1. (2. 15). ( sk −1 , sk ) uk =−1. = Lc rk ,1 + L(uk ) + Lex (uk ) where n. ∑. Lex (uk ) = log. eα k −1 ( sk −1 ) ⋅ e. ∑. 1 Lc rk ,v xk ,v 2 v=2. ⋅ e βk ( sk ). ( sk −1 , sk ) uk =+1. .. n. ∑. eα k −1 ( sk −1 ) ⋅ e. ∑. 1 Lc rk ,v xk ,v 2 v=2. (2. 16). ⋅ e βk ( sk ). ( sk −1 , sk ) uk =−1. The term Lex(uk) is called extrinsic information since it’s a function of the redundant information that comes from the encoder. It removes the information about the systematic input and a priori information from L(uˆk ) . Therefore, this term is useful to estimate a priori probability for the next component decoder, and great performance improvement in iterative MAP decoding can be achieved.. 2.2.2. The Max-Log-MAP algorithm. As we can see, the MAP algorithm involves too many exponentiations and multiplications. These are quite complex for hardware realization. Thus, an approximation of MAP algorithm termed Max-Log-MAP algorithm [11] was derived for simple implementation of MAP decoders. Instead of calculating eγ k , eα k , and e βk directly, all computations are done in logarithm domain. Here we define γk, αk, and βk as transition metric, forward path metric and backward path metric respectively. γk can be formulated as. γ k ( sk −1 , sk ) = log P( sk , rk | sk −1 ) .. (2. 17). Similarly, referring to (2. 4), αk and βk can be expressed as. α k ( sk ) = log P( sk , r j < k ) and 14. (2. 18).

(27) β k −1 ( sk −1 ) = log P (r j >k | sk ). (2. 19). respectively. After substituting (2. 17), (2. 18), and (2. 19), L(uˆk ) in (2. 15) can be re-written as. L(uˆk ) = log. ∑. exp (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) + β k ( sk ) ). ∑. exp (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) + β k ( sk ) ). ( sk −1 , sk ) uk =+1. .. (2. 20). ( sk −1 , sk ) uk =−1. By utilizing the approximation of log(eδ1 + eδ 2 + " + eδ n ) ≈ max(δ1 , δ 2 ," , δ n ) ,. (2. 21). L(uˆk ) can be further simplified to. L(uˆk ) = max (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) + β k ( sk ) ) ( sk −1 , sk ) uk =+1. − max (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) + β k ( sk ) ) .. (2. 22). ( sk −1 , sk ) uk =−1. This computation consists of forward and backward recursions that repetitively compute the. αk and βk, and can be expressed by α k ( sk ) = max (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) ) sk −1 ,uk. (2. 23). and. β k −1 ( sk −1 ) = max ( β k ( sk ) + γ k ( sk −1 , sk ) ) . sk ,uk. (2. 24). Both equations are add-compare-select (ACS) operations, which are similar to the path metric updating of Viterbi algorithm.. 2.2.3. The Log-MAP algorithm. It can be figured out easily that Max-Log-MAP algorithm is a sub-optimal solution for turbo decoding since an approximation of (2. 21) is used to reduce the complexity of MAP algorithm. This problem can be solved by Log-MAP algorithm [11]. It employs the Jacobian algorithm 15.

(28) log(eδ1 + eδ 2 ) = max(δ1 , δ 2 ) + log(1 + e. − δ1 −δ 2. ). = max(δ1 , δ 2 ) + f c ( δ1 − δ 2 ),. (2. 25). where fc(|δ1-δ2|) is a correction function, and thus the performance can be improved. It has been proved that (2. 21) can be computed exactly by a recursive operation of (2. 25) [9].. log(eδ1 + eδ 2 + " + eδ n ) = log(∆ + eδ n ),. ∆ = eδ1 + eδ 2 + " + eδ n−1 = eδ. = max(log ∆, δ n ) + f c ( log ∆ − δ n ). (2. 26). = max(δ , δ n ) + f c ( δ − δ n ) Substituting (2. 18) and (2. 19) into (2. 25), the forward and backward recursions can be represented as. α k ( sk ) = max* (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) ). (2. 27). β k −1 ( sk −1 ) = max* ( β k ( sk ) + γ k ( sk −1 , sk ) ) ,. (2. 28). sk −1 ,uk. and sk ,uk. where the max*(.) operation is defined as max*(δ1 , δ 2 ) = max(δ1 , δ 2 ) + log(1 + e. − δ1 −δ 2. ).. (2. 29). Finally, L(uˆk ) can be obtained by L(uˆk ) = max * (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) + β k ( sk ) ) ( sk −1 , sk ) uk =+1. − max * (α k −1 ( sk −1 ) + γ k ( sk −1 , sk ) + β k ( sk ) ) .. (2. 30). ( sk −1 , sk ) uk =−1. The performance of Log-MAP algorithm is identical to that of MAP algorithm. However, the complexity is also increased compared with Max-Log-MAP algorithm since computing fc(.) still involves complicated exponentiations and multiplications. Thus, the values of fc(.) are usually stored in a pre-computed table and Log-MAP algorithm can be implemented by table look-up. It has been found that excellent performance can be obtained with 8 stored values and |δ1-δ2| ranging between 0 and 5, and no improvement is achieved with a finer 16.

(29) representation [9].. 2.2.4. SNR sensitivity of Max-Log-MAP and Log-MAP algorithm. Referring to (2.13) and its followed deductions, it’s evident that both MAP and log-MAP algorithm requires SNR estimation to obtain the value of channel reliability, i.e. Lc. Unfortunately, accurate estimation cannot be achieved easily. Several papers have discussed the effect of SNR mismatch in turbo decoding. In [12], the simulations show that about -3 to +6dB SNR estimation offset is tolerable before significant performance degradation. However, Max-Log-MAP algorithm is able to provide a SNR independent scheme if a priori information is initialized with a reasonable value, such as all zeros for each state [13]. Due to the linearity of max(.) operations, the term Lc can be canceled out while computing L(uˆk ) . The comparison of Max-Log-MAP and Log-MAP algorithm under different SNR estimation offsets was made in [13].. Although Log-MAP algorithm provides the performance better than that of Max-Log-MAP algorithm, it suffers the risk of serious SNR mismatch offset. Thus, channel characteristics play an important role in practical implementation. It has been concluded in [13] that if channel characteristics change over time, the Max-Log-MAP decoder is suitable to be the constituent decoder in turbo decoding. Otherwise, Log-MAP decoder should be preferable in the aspect of coding gain.. 2.3. Sliding Windowed Approach As what we described in the previous section, the MAP-series algorithm (including MAP. algorithm, Max-Log-MAP algorithm, and Log-MAP algorithm) requires the entire block message to be received before decoding procedure can be started since backward path metric 17.

(30) computation needs information at the end of trellis. This restriction enlarges the memory requirement for hardware implementation of turbo decoder. For example, the maximum block length of 3GPP2 standard is 20730, which means 20730 metrics should be stored. Besides, long output latency is also introduced. This is disadvantageous for turbo code in real-time application.. A simple method to solve these problems is to divide data stream into many sub-blocks. However, the last bits in these sub-blocks suffer lower error tolerance because of the lack of initial metrics for backward recursion. Thus, a sliding windowed approach was proposed in [14] and later on in [15] to overcome this drawback. It utilizes the fact that the backward path metrics can be highly reliable even without knowing the initial state if the backward recursion goes long enough. The windowed processing schedule used in our design is illustrated in Fig. 2.7 and the detail operating flow is described as follows.. sub-block. t1 t2 t3. .... i. i+1. β2. α. β1. β2. α. β1. β2. α. β1. β2. α. t4. i+2. i+3. .... β1. Fig. 2.7 The windowed MAP algorithm. Initially, the received data block is divided into many sub-blocks, with a sub-block length of L. L is called the convergence length. Typically, it’s about five times the constraint length of the encoder. In 3GPP2 standard, the constraint length is 4. For each sub-block i, the 18.

(31) forward recursion computes the forward path metrics α and storing these values into memory. In parallel, an additional backward recursion β1 is performed in the next sub-block i+1. Once. β1 operation in sub-block i+1 is finished, the last backward path metric obtained for each state is regarded as a reliable initial β for sub-block i to start its backward recursion, which is labeled as β2 in Fig. 2.7. Finally, the L(uˆk ) can be computed by α, β2, and γ. Fig. 2.8 shows the influence of different sub-block lengths on the performance of turbo code.. Performance of Turbo Decoder under different sub-block length (N=20730, 16-QAM, Code Rate=1/5, 6 iterations). 0. 10. sublen=N sublen=24 sublen=20 sublen=16 sublen=12. -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -1. -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. 4. SNR. Fig. 2.8: Performance comparison among different sub-block lengths in 3GPP2 standard. 19.

(32) Chapter 3 Principle of Convolutional codec 3.1. Convolutional Code For the convolutional code, its encoder is constructed with several memory elements and. modulo-2 adders. In general, it is usually expressed as a (n, k, v) convolutional encoder where n, k, v are the number of output, the number of input and the number of memory elements respectively. 3GPP2 standard specifies rate 1/2, 1/3, 1/4, and 1/6 convolutional codes. All of them have a constraint length of 9. An example of rate 1/2 convolutional encoder with the generator matrix of [753, 561](octal) is illustrated in Fig. 3.1. For each input information bit, it generates two code symbols (c0 and c1) by the generator matrix. The generator matrices for other code rate convolutional codes are listed in Table 1.2. The convolutional encoder should be initialized with all-zero state.. g0. c0. Input information. g1 Fig. 3.1 Rate 1/2 convolutional encoder with the constraint length of 9. 20. c1.

(33) 3.2. Viterbi Decoding Up to now, Viterbi algorithm [16] is the optimal solution to decode the convolutional. code. It utilizes the maximum likelihood decoding algorithm and searches the shortest path through a weighted graph. In fact, Viterbi algorithm has become a standard due to its fairly decoding complexity. Before explaining Viterbi algorithm, a system platform, which is shown in Fig. 3.2, should be interpreted first.. m. Convolutional Encoder. c. x. Modulator. Channel. r. Viterbi Decoder. ^ m. Fig. 3.2 A system platform of the convolutional codec. Initially, the message sequence m is encoded into the codeword sequence c. After signal modulation, the modulated sequence x is transmitted into the channel. In the receiver, the sequence r is received. The major concept of Viterbi algorithm is to find the maximum likelihood sequence mˆ according to r. Theoretically, it’s equivalent to maximize the probability of P(m|r). Using Baye’s rule P(m | r ) =. P ( m) ⋅ P ( r | m) P(r ). (3. 1). where P(r) is independent of m. Thus, what the decoder does is to maximize the probability of P(r | m). Assume the length of the received sequence is τ; then P(r | m) can be expressed as P ( r m) = P ( r | x ) τ. = ∏ P (rt | xt ). (3. 2). t =1. τ. n −1. = ∏∏ t =1 i = 0. ⎛ (r − x )2 ⎞ 1 exp ⎜⎜ − t ,i 2t ,i ⎟⎟ 2σ 2πσ ⎝ ⎠. Similarly, these works can be transformed to logarithm domain to reduce computing complexity. The probability P(r|m) in logarithm domain is given by 21.

(34) τ. log P(r | m) = ∑ log P(rt | xt ). (3. 3). t =1. For AWGN channel, (3. 3) can be further rewritten as ⎛ (r − x )2 ⎞ 1 exp ⎜ − t ,i 2t ,i ⎟ ⎜ ⎟ 2σ 2πσ t =1 i =0 ⎝ ⎠ τ n −1 ( r − x ) 2 nτ = − log ( 2π ) − nτ log (σ ) − ∑∑ t ,i 2t ,i 2 2σ t =1 i = 0 τ. n −1. log P(r m) = ∑ log ∏. (3. 4). In other words, to maximize the probability of P(r | m) is to minimize Euclidean distance. τ. n −1. ∑∑. (rt ,i − xt ,i ) 2. (3. 5). 2σ 2. t =1 i = 0. In order to compute Euclidean distance, Viterbi algorithm defines the branch metric (BM, also called transition metric, or simply TM) λt ( st −1 , st ) and the path metric Γt , sk as follows. n −1. λt ( st −1 , st ) = ∑. (rt ,i − xt ,i ) 2. i =0. 2σ 2. St −1 = 0,1,...,2v. (. Γt , sk = min Γ t −1, st −1 + λt ( st −1 , st ). ). (3. 6) (3. 7). It is clear that the path metric Γt , sk is the minimum Euclidean distance for state st. At each time index, the decoder computes and compares the metrics of all branches that entering the state. The branch with the minimum metric and its corresponding decision bit will be preserved and others will be eliminated. The history record of the decision bits is called survivor. According to the minimum path metric at each time index, the maximum likelihood sequence can be estimated.. Finally, the steps of Viterbi algorithm can be summarized as follows. 1.. Initialize all path metric storages and survivor memory.. 2.. According to the received sequence r, compute the branch metric λt ( st −1 , st ) for each state transition.. 3.. Accumulate the path metric with the branch metric that will converge toward the same 22.

(35) state. 4.. Update the path metric storage for each state according to the following principle.. (. Γt , sk = min Γ t −1, st −1 + λt ( st −1 , st ). ). The decision bit of each state is also stored into survivor memory at the same time. 5.. Decode the message sequence according to the minimum path metric and the survivor.. 6.. Repeat this process until all messages are decoded.. 3.3. Trace-back Method The trace-back method is a technique to trace the maximum likelihood sequence in the. survivor memory. Here we’ll use a (2, 1, 2) convolutional code with the generator matrix of [111, 101]binary as the example. Its trellis diagram and the corresponding contents of the survivor memory are shown in Fig. 3.3. In this figure, all dotted lines represent the eliminated paths. Once the upper path entering into this state is chosen, the decision bit is set as zero; otherwise, it’ll be set to one. rk=11 λ=2. S00. Γ=0. 0 1. S01. Γ=3 1 0. S10. Γ=2. 2. 1. S11. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. Γ=3. Fig. 3.3 Trellis diagram of the (2, 1, 2) convolutional code and its survivor memory 23.

(36) After all symbols are received, the maximum likelihood sequence can be decided by trace-back method. This procedure starts from the state with minimum path metric happened in S00. By recursively shifting the state number left and inserting the decision bit stored in the survivor memory back to the right hand side, decoding procedure can be completed. The overall trace-back operation of the example in Fig. 3.3 is illustrated with Fig. 3.4. rk=11 λ=2. S00. Γ=0. 0 1. S01. Γ=3 1 0. S10. Γ=2. 2. 1. S11. 1. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. Γ=3. Fig. 3.4 Trace-back procedure of the convolutional code. In fact, the length of received symbols may be quite long. If we don’t start the trace-back operation until all symbols are received, an extremely large survivor memory is required. This is impossible for chip realization. Thus, a suitable trace-back length should be defined without serious performance degradation. Similar to what we introduced in section 2.3, it’s about five times the constraint length of the encoder. A simulation result under different trace-back lengths is shown in Fig. 3.5.. 24.

(37) Performance of Viterbi Decoder under different trace-back length (16-level soft-input, QPSK, Code Rate=1/2). 0. 10. tblen=32 tblen=64 tblen=96. -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. 0. 0.5. 1. 1.5. 2. 2.5 SNR. 3. 3.5. 4. 4.5. 5. Fig. 3.5 The simulation result of Viterbi decoder under different trace-back lengths. 3.4. Summary In this chapter, we have made some brief descriptions about convolutional code. Viterbi. algorithm, which is widely applied in convolutional decoding, is also presented here. Due to power-saving issue, instead of using register-exchange method, we use trace-back method to manage the storage of the survivor sequences. A suitable track-back length is chosen according to the simulation result shown in Fig. 3.5. This will help us reduce the memory requirement with little performance degradation.. 25.

(38) Chapter 4 Fixed Point Analysis of Dual Mode Turbo/Viterbi Decoder. Turbo decoding uses SISO decoders to achieve a fairly good coding performance. In previous chapter, we analyze several factors that affect the characterization of turbo code by importing floating point information as the required soft-input. However, all soft-inputs should be bounded since infinite precision is impossible to be achieved for the practical implementation. In general, coding performance may suffer quantization loss due to internal bit-width limitation. A trade-off between hardware cost and the performance must be concerned before the chip implementation. For turbo decoder, the Max-Log-MAP decoder is adopted as our SISO decoder because of the lack of the channel characteristics in our 3GPP2 simulation platform. Thus, the following fixed point analysis will base on the Max-Log-MAP algorithm. For Viterbi decoder, a sixteen level soft-input precision is proposed. The relative bit-width for path metric computation is also discussed here. In this chapter, we’ll show an optimal solution that the hardware cost can be minimized without critical performance degradation for turbo/Viterbi decoder in 3GPP2 standard.. 4.1. Fixed Point Analysis for Turbo Decoder. 4.1.1. Input LLR Representation. Quantization of input LLR will directly influence not only the performance of Turbo decoding but also the memory requirement of the design. The reason is obvious that since the 26.

(39) received systematic symbols should be interleaved for the second SISO decoder, the memory size will be directly proportional to the bit-width of systematic symbol. To determine the acceptable finite precision of the input LLR, several simulations under AWGN channel with BPSK and 16-QAM have been performed using a floating point turbo decoder with quantized input LLRs. Note that the value of channel reliability “Lc” can be assumed as 1 for the practical implementation without performance degradation. Hence, the received symbol vector introduced in section 2.2 can be regarded as input LLR directly. Fig. 4.1 plots the quantization loss of the bounded input symbols with BPSK modulation and rate 1/2 turbo decoding. Fig. 4.2 shows the same simulation but with 16-QAM modulation and rate 1/5 turbo decoding. The former case is used while data and signaling are transmitted from a mobile station to a base station; and the later case is used while data are sent in the opposite direction. The dotted line is a rough threshold corresponding to 5% frame error rate (FER), which is the target FER specified in 3GPP2 standard. Note that a.b shown in these figures denotes the quantization scheme where a is the number of bits used for the integer part, and b is the number of bits used for the fractional part. Simulation result shows that the performance of 3.3 scheme is slightly worse than that of 4.2 scheme in Fig. 4.2. Nevertheless, the performance is going to be better than others in Fig. 4.1, and thus is recommended for the Max-Log-MAP decoder.. 27.

(40) Performance of turbo Decoder under different fixed point input format (N=20730, BPSK, Code Rate=1/2, 6 iterations). 0. 10. floating point 3.3 format 4.2 format 3.2 format. -1. 10. -2. 10. -3. 10. -4. BER. 10. -5. 10. -6. 10. -7. 10. -8. 10. -9. 10 -2.25. -2. -1.5. -1. SNR. Fig. 4.1 Fixed point simulation result of the input symbols with BPSK modulation and rate 1/2 turbo decoding Performance of Turbo Decoder under different fixed point input format (N=20739, 16-QAM, Code Rate=1/5, 6 iterations). 0. 10. floating point 3.3 format 3.2 format 4.2 format. -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -1. -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. 4. SNR. Fig. 4.2 Fixed point simulation result of the input symbols with 16-QAM modulation and rate 1/5 turbo decoding 28.

(41) 4.1.2. Extrinsic Data Representation. Extrinsic data provides important information for turbo codes to perform the iterative decoding. Quantization of extrinsic information should be done carefully or the effect of iterative decoding will be lessened. Strictly to say, 8-bit integer is necessary for our proposed design to avoid overflow, which will be explained later in section 4.1.3. However, this is a heavy burden in hardware implementation since it should be interleaved or de-interleaved to be a priori information of the other SISO decoder for turbo decoding, and thus large memory is required. Thus, a fixed point simulation of the extrinsic data for cost consideration is performed, as shown in Fig. 4.3. Note that any extrinsic information exceeding the range that can be expressed is pulled back the nearest value instead of truncating it directly. This will ensure that no meaningless performance degradation occurs. The idea comes from the fact that the extrinsic data is averagely small while channel noise is averagely high. On the contrary, it will be large if channel provides good transmission quality. In the former case, the data value is small so that bit-width requirement can be surely truncated. In the later case, since the transmission quality is good, the extrinsic information just needs to be large enough with little influence on the error correction ability of iterative decoding. Therefore, the bit-width can be reduced for both cases. We can see that with four or more extrinsic data integer bits, the performance is close to that of floating point simulation result. So, the 4.2 scheme should be the best choice.. 29.

(42) Performance of Turbo Decoder under different extrinsic data bounds (N=20730, 16-QAM, Code Rate=1/5, 6 iterations). 0. 10. floating point 5.2 format 4.2 format 3.2 format. -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -1. -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. 4. SNR. Fig. 4.3 Fixed point simulation result of the extrinsic information. 4.1.3. Bit width of Internal Variables. In previous sections, we had determined the range of all input information, including input LLRs and a priori data that comes from the extrinsic data of the other SISO decoder. According to the bounded inputs, the bit width of internal variables, γk, αk, and βk, for Max-Log-MAP decoding can be derived. Firstly, the bound of γ k can be decided by (2.14) and (2.17). Given integer range of Bin for input LLRs and that of BLex for a priori data, the maximum difference of γ k is denoted as ∆γ k and derived by ∆γ k ≤ n × Bin + BLex. (4. 1). In 3GPP2 standard, the code rate of component encoder is 1/3 and thus the value “n” in (2.14) should be 3. For Bin=8 (3-bit integer) and BLex=16 (4-bit integer), the maximum difference of. γ k in our design should be 40, and thus 6-bit integer are required. 30.

(43) In spite of the recursion of αk and βk computation, the range of the differences between path metrics is still bounded and provided with the following theorem by [17] and [18].. Theorem: For an RSC encoder with m shift-registers and a maximum Hamming distance of dm between any two paths across m trellis sections, the difference of path metrics is bounded by ∆α k ≤ m × BLex + d m × Bin. (4. 2). ∆β k ≤ m × BLex + d m × Bin. (4. 3). and. The theorem utilizes the fact that the probabilities of any two states at time index k originate from the same set of states at time index k-m. Thus, the difference of any two path metrics at time index k is dependent only on the branch metrics from time index k-m to k. Fig. 4.4 illustrates the paths of metrics passing with m=3. Let Smax be the state with maximum path metric at time index k, and Smin be that with minimum path metric at time index k. Then, the bound can be expressed as. α ( Smax ) − α ( Smin ) = M max − M min. (4. 4). where Mmax and Mmin are the accumulated branch metrics from time index k-m to k for Smax and Smin respectively. The result tells us clearly that the bound is determined by the maximum difference of branch metrics within m trellis sections. For 3GPP2 standard, these two paths are labeled in Fig. 4.4. For Bin=8 and BLex=16, the corresponding bound is 96 and its relative bit-width for integer part is 7.. 31.

(44) time index k-3. 000. k-2. 000. k-1. 000. k. Smin. 11 1. 11. 0. Smax. 10 0. 110 systematic. 2nd parity 1st parity. Fig. 4.4 An eight-state trellis diagram illustrating message passing within 3 trellis sections.. After evaluating the bounds of γ k , α k , and β k , the bound on the magnitude of the output LLR, L(uˆk ) , can be derived with the following theorem by [17] and [18].. Theorem: Given ∆α k as the bound of the difference of the forward path metric, and ∆γ k as the bound of the difference of the branch metric, the magnitude of the output LLR is bounded by L(uˆk ) ≤ ∆α k + ∆γ k. 32. (4. 5).

(45) The theorem is simply derived from the definition of L(uˆk ) listed in (2.15). Replace. αk-1(sk-1) and γk(sk-1,sk) in the numerator of (2.15) by max(αk-1) and max(γk), respectively. Also replace αk-1(sk-1) and γk(sk-1,sk) in the denominator of (2.15) by min(αk-1) and min(γk), respectively. Then (2.15) can be extended to max α k −1 ( S k −1 ) ⋅ max γ k ( S k −1 , S k ) ⋅ L(uˆk ) ≤ ln. min α k −1 ( S k −1 ) ⋅ min γ k ( S k −1 , S k ) ⋅. ∑. β k ( Sk ). ∑. β k ( Sk ). Sk ,uk =+1. Sk ,uk =−1. (4. 6). Since each state at time index k originates from two branches, exactly one of which is corresponding to an information bit of 0 and the other one must be corresponding to an information bit of 1, we can obtain. ∑. S k ,uk =+1. β k ( Sk ) =. ∑. Sk ,uk =−1. β k (Sk ). (4. 7). and (4. 6) can be further simplified as L(uˆk ) ≤ ∆α k −1 + ∆γ k. (4. 8). Similarly, by replacing αk-1(sk-1) and γk(sk-1,sk) in the numerator of (2.15) by min(αk-1) and min(γk), respectively, and those in the denominator of (2.15) by max(αk-1) and max(γk), respectively, the lower bound of L(uˆk ) can be obtained by L(uˆk ) ≥ −∆α k −1 − ∆γ k. (4. 9). Theoretically, the range of L(uˆk ) in our proposed design should be between -136 and +136. However, the probability of L(uˆk ) ≥ 128 is extremely small. For cost consideration, we can use only 8 bits to represent the integer part of L(uˆk ) with little performance degradation. Finally, the bound on the magnitude of the extrinsic information Lex (uk ) can be derived base on (2. 15) and formulated as follows. Lex (uk ) ≤ L(uˆk ) − Lc rk ,1 − L(uk ). (4. 10). In our case, the corresponding bound should be Lex (uk ) ≤ 124 . Hence, the suitable bit-width. 33.

(46) for Lex (uk ) is the same as that of L(uˆk ) .. 4.1.4. Performance under Fixed Point Simulation. After confirming the bound of each internal variable, a simulation for cost-down of path metric storage is performed in Fig. 4.5. For the 6.2 scheme, the performance is matched to that of 7.2 scheme, which is the upper bound of the path metric. This result is not strange. The drastic condition discussed in section 4.1.3 will occur unless all received input symbols, including a priori probability, simultaneously reach the relative large values cross three time sections in trellis successively. Such event is seldom observed. Consequently, the 6.2 scheme is chosen in our design. Finally, we simulate the performance of whole turbo decoder under complete fixed point condition. In Fig. 4.6, it shows that there is only about 0.25 dB design loss compared with the floating point simulation result in AWGN channel. Performance of Turbo Decoder under different PM bit-width (N=20730, 16-QAM, Code Rate=1/5, 6 iterations). 0. 10. -1. 10. 7.2 fomat 6.2 fomat 5.2 fomat. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -1. -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. 4. SNR. Fig. 4.5 Fixed point analysis with different bit-width of path metrics in turbo decoding 34.

(47) Performance comparison between floating point and fixed point scheme (N=20730, 16-QAM, Code Rate=1/5, 6 iterations). 0. 10. floating point scheme fixed point scheme -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -1. -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. 4. SNR. Fig. 4.6 Design loss of fixed point turbo decoder. 4.2. Fixed Point Analysis of Viterbi Decoder Soft decision Viterbi decoder provides a better error correction capability. With. increasing quantization level, the error probability can be further reduced with a penalty of linearly increased complexity. However, the degree of improvement will saturate as the quantization level reaches a threshold. A simulation to evaluate the performance improvement with different quantization level is done and shown in Fig. 4.7. All schemes are set to be uniform quantization and optimal step size. The BER curve with 128-level soft-input is assumed to be the performance limitation of code rate 1/2 256-state Viterbi decoding. As we can see, the improvement from the scheme with 8-level soft-input to that with 16-level is up to 0.4dB. Nevertheless, the 32-level scheme gains about only 0.2dB from 16-level scheme. 35.

(48) Hence we can conclude that the 16-level soft decision yields a good trade-off between performance and complexity and thus is chosen for the proposed design.. Performance of Viterbi Decoder under different soft-input level (QPSK, Code Rate=1/2, tblen=64). 0. 10. 8 level 16 level 32 level 128 level. -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -8. 10. 0. 0.5. 1. 1.5. 2. 2.5 SNR. 3. 3.5. 4. 4.5. 5. Fig. 4.7 The performance of Viterbi decoder with different quantization levels. To restrict the difference of any two states at any time is also a critical issue for Viterbi decoding. According to Viterbi algorithm, it supposes that all survivor paths will converge to the same node among the truncation length L. This assumption can be expressed by Fig. 4.8. A principle of choosing the truncation length is introduced in section 3.3. Computing all path metrics from t=0, we can get Γ k ( S1 ) = Γ1 + Γ k − L. (4. 11). Γ k ( S2 ) = Γ 2 + Γ k − L. (4. 12). Then, the difference of any two path metrics can be written as 36.

(49) Γ k ( S1 ) − Γ k ( S 2 ) = Γ1 − Γ 2 ≤ BL. (4. 13). where B denotes the maximum value of the branch metric.. Γk-L. t=0. t=k-L. Γ1. Γk(S1). Γ2. Γk(S2). L. t=k. Fig. 4.8 The convergence of any two survivor paths in Viterbi algorithm. In 3GPP2 standard, the minimum code rate is 1/6. Combining with 16-level soft input, the value of B is 90. Therefore, the upper bound of path metric in our proposed design should be 5760, which means 13 bits at most are required theoretically. However, this case rarely happens. In fact, the bit-width of path metric will directly influence the size of its storage. A simulation for the cost consideration of path metric storage was done and the result is shown in Fig. 4.9. It indicates that obvious performance degradation occurs while 9-bit scheme is adopted. Therefore, we’ll use 10 bits for path metric representation in Viterbi decoder, and corresponding storage requirement should be at least 2560 bits. Finally, a performance analysis on system performance for each supported code rate is concluded in Fig. 4.10.. 37.

(50) Performance of Viterbi Decoder under different PM bit-width (16-level soft-input, QPSK, Code Rate=1/6, tblen=64). -1. 10. PM PM PM PM. -2. bit-width=9 bit-width=10 bit-width=11 bit-width=13. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -5. -4.5. -4. -3.5. -3. -2.5. -2. -1.5. -1. SNR. Fig. 4.9 The performance of Viterbi decoder under different bit-widths of path metric. Overall performance of Viterbi Decoder under different code rate (QPSK). 0. 10. Code Rate 1/2 Code Rate 1/3 Code Rate 1/4 Code Rate 1/6. -1. 10. -2. 10. -3. BER. 10. -4. 10. -5. 10. -6. 10. -7. 10. -8. 10. -5. -4. -3. -2. -1. 0. 1. 2. 3. 4. 5. SNR. Fig. 4.10 The performance analysis on system performance for each kind of code rate 38.

(51) 4.3. Summary In this chapter, some fixed point performance analysis for internal variables is done for. both operating modes. Summaries of bit-width decision for turbo mode and Viterbi mode are made in Table 4.1 and Table 4.2 respectively. It is verified that the design loss is only about 0.25dB for both turbo mode and Viterbi mode. A comparison with other similar work [19] for turbo mode is also made in Table 4.3. It shows that the results are nearly the same.. Table 4.1 Summary of bit-width decision for turbo mode Variables. Lcrk,v. L(uk). ∆γ. ∆α. ∆β. L(uˆk ). Lex(uk). Bounds. −∞ ~ +∞. -8 ~ +8. 40. 96. 96. -136 ~ 136. -124 ~ 124. Bit-width. 6 (3.3). 6 (4.2). 8 (6.2). 8 (6.2). 8 (6.2). 9 (7.2). 9 (7.2). Table 4.2 Summary of bit-width decision for Viterbi mode Variables. soft input. ∆BM. ∆PM. Bounds. 0 ~ 15. 0 ~ 90. 0 ~ 5760. Bit-width. 4. 7. 10. Table 4.3 A comparison of bit-width decision with [19] for turbo mode Variables. Lcrk,v. L(uk). ∆γ. ∆α. ∆β. L(uˆk ). Lex(uk). our work. 6 (3.3). 6 (4.2). 8 (6.2). 8 (6.2). 8 (6.2). 9 (7.2). 9 (7.2). [19]. 5 (3.2). 6 (4.2). 5*. 6*. 6*. 8*. 8*. * required bits for integer part only.. 39.

(52) Chapter 5 Architecture of Proposed Dual Mode Turbo/Viterbi Decoder 5.1. Architecture of Integrated Turbo/Viterbi Decoder Because of the trellis decoding structure of both decoders, the combination takes the. advantage of resource sharing in the ACS and memory unit, leading to a much compact architecture for 3GPP2 system. The proposed architecture of integrated turbo/Viterbi decoder is shown in Fig. 5.1. The shared components are represented with gray blocks. A specified input is used to switch the operating mode of the proposed design. While the turbo mode is activated, the components for Viterbi mode are all disabled by gated clock and vice versa. This will guarantee that redundant power consumption can be avoided in both operating modes. According to the operating mode, the input data goes through the input cache or transition metric unit (TMU, also called branch metric unit or simply BMU) of Viterbi decoder (VD) for turbo mode and Viterbi mode, respectively. In turbo mode, the sliding windowed approach introduced in section 2.3 is adopted with a sub-block length of 20. The data output of the input cache will later go through three additional TMU for data preparation. The overall architecture consists of 24 ACS units, which are separated into 3 blocks to complete α, β1, and β2 recursions in parallel in Turbo mode. In Viterbi mode, only 16 of 24 ACS units are used for trellis decoding. The path metrics in both algorithms are obtained by accumulating branch metrics. Finally, the data output of ACS units may be imported into LLR. 40.

(53) computation unit to do iterative decoding for turbo mode or into path metric unit (PMU) of VD so that trace back can be done periodically in Viterbi mode. In Fig. 5.1, the memory occupies a significant area of our design. It includes input cache, forward path metric storage of turbo decoder, and interleaver/de-interleaver memory shared with survivor memory in Viterbi mode. To save chip area, time-multiplexing method is utilized to provide double memory access frequency so that all memory blocks but input cache are implemented with single-port SRAM.. Cache controller Input cache. TMU(β 2). ACS (β 2). SRAM TD: systematic symbols VD: survivor memory. TMU(β 1 ). TMU(α ). ACS (β 1). ACS ( α). VD PMU. VD TMU. Interleaver address generator. SRAM(α). LLR unit. TD LIFO. SRAM TD: extrinsic symbols VD: survivor memory. VD LIFO. Fig. 5.1 The proposed architecture of integrated turbo/Viterbi decoder 41.

(54) 5.2. Architecture of Turbo Decoder The architecture of integrated turbo/Viterbi decoder operated in turbo mode is shown in. Fig. 5.1, in which all disabled blocks and unconnected lines are represented by dotted lines. Although iterative decoding with ten iterations provides 0.2dB coding gain compared with that with six iterations, the former scheme is not adopted in our design due to its longer output latency and higher power dissipation. Detail operating flow is described as follows.. Cache controller Input cache. TMU(β 2). ACS (β 2). SRAM TD: systematic symbols VD: survivor memory. TMU(β 1 ). TMU(α ). ACS (β 1). ACS (α). VD PMU. VD TMU. Interleaver address generator. SRAM(α). LLR unit. TD LIFO. SRAM TD: extrinsic symbols VD: survivor memory. VD LIFO. Fig. 5.2 The architecture of integrated turbo/Viterbi decoder in turbo mode 42.

(55) 5.2.1. Single MAP Decoder design. In general, the block diagram of turbo decoder can be expressed as Fig. 2.4, which consists of two MAP decoders, two interleavers, and one de-interleaver. To implement the turbo decoder according to this diagram directly is too complicated and not efficient. Since two constituent decoders are identical, a single MAP decoder is proposed to not only reduce design cost but also simplify the control logic for two SISO decoders.. To achieve a single MAP decoder architecture, a full decoding iteration is split into two phases. In the first phase for the SISO decoder1, the MAP decoder reads systematic data, parity data and extrinsic values which come from the other decoder after de-interleaving. The output extrinsic data are stored in memory. As in the second phase for the SISO decoder2, the MAP decoder copes with permuted systematic data, parity data from the second encoder, and a priori values which are the interleaved extrinsic output from SISO decoder1. A simplified architecture of Fig. 2.4 is illustrated in Fig. 5.3. Note that there is an additional input cache and only one memory block for extrinsic data storage here. These will be introduced later in sub-sections 5.2.2 and 5.2.6 respectively.. ~ Lex(u). r0. Input cache. r1 r2. MAP Decoder. Lex(u). SRAM 20730x6. 60x24. SRAM. ^ L(u). 20730x6. Interleaver address generator Fig. 5.3 A single MAP decoder architecture for turbo decoding 43.

(56) 5.2.2. Cache design. As what we mentioned in the previous section, a sliding windowed approach is adopted in our design. Referring to Fig. 2.7, the data of each sub-block needs to be read three times by ACS-β1, ACS-α, and ACS-β2 units separately. Thus, an input cache is implemented to reduce repeated accesses of external memory, and power-down can also be achieved. The cache keeps three consecutive sub-blocks, and is equipped with one writing port for data updating and three reading ports for ACS units. As shown in Fig. 5.4, the cache is implemented by a dual-port SRAM with the size of 60x24 bits, and uses time multiplexed approach to provide four data ports. A set of additional registers is employed at output port-2 to guarantee that all outputs of the cache will be synchronized at the same clock rising edge. Detail timing chart is shown in Fig. 5.5.. 2x clock rate I. Address-0 Address-1 Address-2. dual-port 60×24 memory. Input port-0. O0. D. O1 O2. Address-3. Fig. 5.4 The input cache architecture. 44. Output port-1 Output port-2 Output port-3.

(57) clock for MAP decoer clock for cache. Port A. WR data. RD β1. …………. Port B. RD α. RD β2. …………. …………. data for computing α. …………. data for computing β1. …………. data for computing β2. Fig. 5.5 The detail timing chart of the proposed input cache. 5.2.3. Transition Metric Unit (TMU). In 3GPP2 standard, eight branch metrics are required for LLR computation. According to the formula listed in (2. 14), each branch metric is obtained by adding or subtracting received symbols and a priori data together depending on the branch codewords. Implementing it directly will consume lots of adders and subtractors. A simple method to overcome this problem is to use an equivalent formula listed in (5. 1) n. γ k ( sk −1 , sk ) = ( Lc rk ,1 + L(uk ) ) ⋅ uk + ∑ Lc rk ,v ⋅ xk ,v. (5. 1). v=2. where uk = {0,1} . By multiplying with uk=0, some terms can be removed without changing the difference of branch metrics. The modified architecture of TMU is shown in Fig. 5.6.. 45.