應用於全球互通微波存取通訊協定的面積優化雙位元迴旋渦輪解碼器

(1)

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

應用於全球互通微波存取通訊協定的面積優化

雙位元迴旋渦輪解碼器

An Area-Efficient Double-Binary CTC Decoder

for WiMAX Applications

學生：胡茗智

(2)

應用於全球互通微波存取通訊協定的面積優化

雙位元迴旋渦輪解碼器

An Area-Efficient Double-Binary CTC Decoder

for WiMAX Applications

研究生：胡茗智

Student：Ming-Chih Hu

指導教授：李鎮宜教授

Advisor：Chen-Yi Lee

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

A Thesis

Submitted to Department of Electronics Engineering & Institute Electronics College of Electrical and Computer Engineering

National Chiao Tung University In Partial Fulfillment of the Requirements

for the Degree of Master of Science

in

Electronics Engineering July 2008

(3)

應用於全球互通微波存取通訊協定的面積優化

雙位元迴旋渦輪解碼器

學生：胡茗智

指導教授：李鎮宜教授

國立交通大學

電子工程學系電子研究所碩士班

摘要

本論文介紹一個雙位元迴旋渦輪解碼理論，同時提出了一個應用於全球互通微波存取通訊協定符合所有種類的面積優化解碼器。我們提出的解碼器可以支援所有定義在 IEEE 802.16e 規格裡的編碼長度。藉由等比例縮減外在資訊，MAX-Log MAP 演算法可以在極小的效能降低下減低硬體複雜度。另外提出了假雙埠暫存檔案可以大量的省下記憶體的使用量以及解碼時所產生的延誤並且允許同時讀取及寫入資料在同一個解碼週期內。除此之外，我們所提出的簡化過後的交錯器只用到了簡單的加法及減法器可以大量的減少硬體的使用量。根據實驗結果，此解碼器在 90nm 製程下最高能達到 30Mb/s 的傳輸速度，晶片的面積是 1.12mm2_。此外，在 0.9V 的供應電壓、166MHz 操作頻率以及編碼長度 2400 下，功率的消耗經量測過後為 32.87mW。本論文另外提供了一個應用隨機更新規則的柵狀解碼理論。藉由使用了隨機

(4)

運算方式，ACS 單位的硬體複雜度可以大大的被簡化。所提出的狀態記憶體增加了隨機切換活動力可避免鎖定在同一個固定的狀態，以及所引用的等比例降低雜訊相依因子可以消去地板錯誤現象。這兩種技術階可讓解碼性能大突的提高。相較於 (2, 1, 3) 的渦輪碼，實驗結果顯示隨機解碼器可以是一個做為降低硬體複雜度的解碼選項。

(5)

An Area-Efficient Double-Binary CTC Decoder

for WiMAX Applications

Student：Ming-Chih

Hu

Advisor：Dr. Chen-Yi Lee

Department of Electronics Engineering

Institute of Electronics

National Chiao Tung University

ABSTRACT

Double-binary convolutional turbo code (CTC) decoding algorithm is introduced in this thesis, and a fully compliant and area-efficient CTC decoder for WiMAX 802.16e is proposed. The proposed decoder can support all code lengths specified in IEEE 802.16e system. By scaling the extrinsic information, the Max-Log MAP algorithm is used such that hardware complexity can be reduced with the minimized performance loss. For saving memory requirement and reducing decoding latency, the pseudo two-port register file is also demonstrated to allow read and write operation within one decoding cycle. Moreover, a simplified interleaver architecture which uses simple addition and subtraction instead of division is proposed to reduce the hardware area and decrease the critical path. Implemented in the 90-nm process, the proposed decoder chip occupied in 1.12mm2 core area can achieve 30Mb/s decoding throughput. The power consumption according to post-layout simulation is 32.87mW operated at supply voltage 0.9V and clock rate 166MHz with block length of 2400.

(6)

can be reduced by simplifying ACS-unit operation. The proposed state memory can increase the random switching activity to avoid the state locked into a fixed state, and noise dependent scaling factor can further eliminate the error floor effect. Both techniques can greatly improve the performance compared to the Viterbi decoding algorithm. Through the simulation analysis and parameter decision for (2, 1, 3) convolutional code, the performance comparison shows that the stochastic decoding algorithm can be one of the candidates for low complexity iterative decoding.

(7)

誌

謝

兩年的碩士班生涯轉眼間就過去了，感謝我的指導教授李鎮宜老師建立了 Si2Lab 良好的研究氣氛與實驗室環境並且總是能夠耐心的、慈祥的指引我走向正確方向，即便老師是如此的忙碌，還是不忘親切的關心每個人的研究進度讓我們能毫無壓力的做研究。感謝張錫嘉老師在我研究遇到挫折時給予我正面的思考方向，很高興遇上了這麼好的老師，給予了我很大的自由度讓我能做自己喜歡的研究；另外在求職的路上也提供了很多的寶貴意見，讓我不致於徬徨無助。特別感謝陳志龍學長，從大一進交大到現在碩二要離開交大了都受到學長的照顧，記得錫嘉老師曾經說過「除了畢業的那張文憑，還有更多的習慣以及態度是需要在這幾年建立的」，感謝學長總是能不厭其煩的教導我正確的研究態度，很多的習慣及態度都是在碩士班這兩年所建立的。感謝林建青學長以及廖彥欽學姊所領導的 OCEAN 研究團隊，總是能給我許多的幫助，讓我能有信心的解決接踵而來的問題。

最後感謝 Si2Lab 以及 OCEAN group 的所有學長姊、學弟妹，以及我身邊的所有朋友，當然，還有最支持我的家人們，因為有你們的陪伴，讓我能不孤單的走完這兩年，謝謝你們。

(8)

List of Figures

1.1 Block diagram of a typical digital communication system . . . 1

2.1 Correction Factor . . . 7

2.2 The process of sliding window MAP algorithm . . . 9

2.3 Turbo encoder . . . 10

2.4 Turbo decoder . . . 12

2.5 Double-binary Convolutional Turbo encoder . . . 14

2.6 Double-binary CTC decoder . . . 16

2.7 Multiplication of two stochastic sequences . . . 18

2.8 Division of two stochastic sequences . . . 19

2.9 Stochastic (scaled) addition . . . 19

2.10 Converting channel probabilities to stochastic streams . . . 20

2.11 An example of constrain node (a), showing a detailed trellis description of its constraint; and (b) the set S corresponding to this constraint . . . 21

2.12 Update rule for stochastic decoding algorithm . . . 22

2.13 Trellis-based stochastic decoder. (a) With matched codeword. (b) With mis-matched codeword . . . 23

3.1 Performance of stochastic decoder . . . 25

3.2 State memory with matched codeword . . . 26

3.3 State memory with mis-matched codeword . . . 27

3.4 Double-side state memory application . . . 28

3.5 Stochastic decoder with state memory usage . . . 29

3.6 NDS factor comparison . . . 30

(11)

3.8 1-stage trellis-based stochastic decoder architecture . . . 32

4.1 CTC encoder for WiMAX standard . . . 34

4.2 Trellis diagram of double-binary CTC . . . 35

4.3 Comparison of iteration number and window size . . . 37

4.4 Scaling factor comparison . . . 38

4.5 Fixed point comparison . . . 39

4.6 CTC decoder block diagram . . . 40

4.7 MAP Decoder Block Diagram . . . 41

4.8 MAP Decoding Timing Flow . . . 41

4.9 Pseudo Two-port Register File . . . 42

4.10 Interleaver Architecture . . . 43

(12)

List of Tables

4.1 Circulation state lookup table (Sc) . . . 36

4.2 Interleaver Function . . . 36

4.3 Summary of fixed representation in turbo decoding . . . 37

4.4 WiMAX CTC decoder chip summary . . . 44

4.5 Comparison among WiMAX CTC decoders . . . 46

A.1 CTC channel coding per modulation . . . 49

A.2 CTC channel coding per modulation (cont.) . . . 50

(13)

Chapter 1 Introduction

1.1 Research Motivation

The fundamental block diagram of a typical digital communication system is shown in Fig. 1.1. Signal transformation from the information source to the transmitter includes source encoding, channel encoding and modulation. The receiver will reverse the signal transformation by demodulation, channel decoding and source decoding. In order to eliminate the effects of noise disturbances, the channel encoder transforms the source codeword into the channel codeword by adding certain structural redundancy. These redundant bits can be used for detecting and correcting the errors. Theoretically, the encoding procedure provides the encoded signal with better distance properties than the un-coded one, and thus channel coding can improve the performance of the overall system.

Information source Source encoder Channel encoder Modulator Channel Demodulator Channel decoder Source decoder Information destination

(14)

In the last decade, trellis-based decoding algorithm applied to convolutional code or turbo code has been adopted in many standards because of its execellent error correction ability. Although the turbo code has outstanding error-correcting performance, the de-coding efficiency and maximum throughput still cannot meet the standard requirement with higher throughput. Hence, double-binary convolutional turbo code is introduced in the recent years because of its high decoding efficiency and excellent error-correcting performance. The double-binary CTC decoder adopted in WiMAX 802.16e [1] standard which defined detail standard providing maximum throughput about 30Mb/s will be pro-posed in this thesis, and the hardware architecture and chip implementation result are also presented in the following chapter.

The main problem of double-binary convolutional turbo code is the higher hard-ware complexity on high-radix ACS-unit from single-binary to double-binary (even triple-binary). In order to design a low complexity trellis-based decoder, the stochastic computa-tion will be applied to trellis-based decoding algorithm. Stochastic arithmetic introduced in 1960’s can break speed bottleneck caused by recursive computation to increase the operating frequency. Besides, the error-correcting performance can be adjusted by decod-ing cycles. As a result, since stochastic decoddecod-ing algorithm has been successfully applied to LDPC code, it might have potential to apply to convolutional code and would be introduced and applied on trellis diagram in this thesis.

1.2 Thesis Organization

This thesis consists of 5 chapters. In chapter 2, different kinds of trellis-based decoding algorithms are reviewed, such as single-binary turbo decdoing algorithm, double-binary turbo decoding algorithm, and stochasitc decoding algorithm using update rule. The stochastic decoder applied on trellis-based decoding algorithm is described in chapter 3. Further improvement and simulation analysis are also stated. Chapter 4 introduces the implementation of double-binary convolutional turbo code applied to WiMAX 802.16e system, including the performance comparison, the hardware architecture, and the chip implementation result. Finally, the conclusion is given in chapter 5. The parameters used in WiMAX 802.16e are also illustrated in appendix A.

(15)

Chapter 2 Trellis-based Decoding Algorithms

2.1 MAP Decoding Algorithm

2.1.1 The MAP Decoding Algorithm

The maximum a posteriori probability (MAP) decoding algorithm, also termed as BCJR decoding algorithm, is developed by Bahl, Cocke, Jelinek, and Raviv in 1974 [2]. The MAP algorithm is optimal for estimating the states or the outputs of a Markov process observed under AWGN channel. It produces the sequence of a posteriori probabilities (APP) from the received sequence r over a discrete memoryless channel (DMC) and minimizes the symbol error probability. Assume for state transition from St

m′ at time t to

Sm(t+1) at time t + 1, we can estimate the joint probability

Pr{S_m(t)′, S (t+1) m , r} = Pr{S (t) m′, S (t+1) m , rt−10 , rt, rN_t+1−1} = Pr{rN−1t+1 |S (t) m′, S (t+1) m , rt−10 , rt} × Pr{Sm(t+1), rt|Sm(t)′, r t−1 0 } × Pr{S_m(t)′, r t−1 0 } = Pr{rN−1t+1 |Sm(t+1)} Pr{Sm(t+1), rt|Sm(t)′} Pr{S (t) m′, r t−1 0 } (2.1)

Notice that (m′_{, m) means the state transition and r}t−1

0 denotes the received sequence

from time 0 to t − 1, rN−1

(16)

rt denotes the codeword symbol at time t. We further redefine the equation in (2.1) : α(S_m(t)′) = Pr{S (t) m′, r t−1 0 } (2.2) γ(S_m(t)′, S (t+1) m ) = Pr{Sm(t+1), rt|S_m(t)′} (2.3) β(Sm(t+1)) = Pr{r N−1 t+1 |Sm(t+1)}, (2.4)

and thus (2.1) can be rewritten as Pr{S_m(t)′, S (t+1) m , r} = α(S (t) m′)γ(S (t) m′, S (t+1) m )β(Sm(t+1)) (2.5)

Now, we will derive the equations (2.2), (2.3), and (2.4) as follow: α(Sm(t+1)) = Pr{Sm(t+1), r t 0} = X S(t) m′∈S Pr{S_m(t)′, S (t+1) m , r t 0} = X S(t) m′∈S Pr{Sm(t+1), rt, |S (t) m′, r t−1 0 } Pr{S (t) m′, r t−1 0 } = X S(t) m′∈S Pr{Sm(t+1), rt, |S (t) m′} Pr{S (t) m′, r t−1 0 } = X S(t) m′∈S γ(S_m(t)′, S (t+1) m )α(S (t) m′) (2.6) Similarly, β(S_m(t)′) = X Sm(t+1)∈S Pr{Sm(t+1), rN−1t |S (t) m′} = X Sm(t+1)∈S Pr{S_m(t+1), rt, rN−1t+1 , S (t) m′}/ Pr{S (t) m′} = X Sm(t+1)∈S Pr{rN−1t+1 |Sm(t+1), rt, S_m(t)′} Pr{S (t+1) m , rt|S_m(t)′} = X Sm(t+1)∈S Pr{rN−1_t+1 |Sm(t+1)} Pr{Sm(t+1), rt|S (t) m′} = X Sm(t+1)∈S β(S_m(t+1))γ(S_m(t)′, S (t+1) m ), (2.7)

where S is the set of all states. From the equations (2.6) and (2.7), we can find that the forward metric α and the backward metric β will be computed recursively in opposite

(17)

direction. Assume the trellis diagram diverges fom zero state at time 0 and converges to zero state at time N, the initial conditions are satisfied:

α(S0(0)) = 1, α(S (0)

x ) = 0 for Sx(0) ∈ S\S0

β(S₀(N )) = 1, β(Sx(N )) = 0 for Sx(N ) ∈ S\S0

(2.8) Furthermore, the branch metric from state m′ _{to m can be computed as}

γ(S_m(t)′, S (t+1) m ) = Pr{Sm(t+1), S_m(t)′, rt} Pr{S_m(t)′} = Pr{S (t+1) m , S_m(t)′} Pr{S_m(t)′} × Pr{S (t+1) m , S_m(t)′, rt} Pr{Sm(t+1), S_m(t)′} = Pr{Sm(t+1)|S (t) m′} Pr{rt|S (t+1) m , S (t) m′} = P (ut)P (rt|ˆvt), (2.9)

where ut is the encoder input that causes the transition S_m(t)′ → S

(t+1)

m , and ˆvt is the

corresponding codeword for 0 ≤ t ≤ N.

For the single-binary Recursive Systematic Convolutional (RSC) encoder input sig-nal ut after BPSK mapping, the log-likelihood ratio (LLR) can be defined as

L(ut), ln

Pr{ut= +1|r}

Pr{ut = −1|r}

(2.10) Therefore, the equation can be further decomposed to

L(ut) = ln P (m′_,m)∈B+1 t Pr{S (t) m′, S (t+1) m |r} P (m′_,m)∈B−1 t Pr{S (t) m′, S (t+1) m |r} = ln P (m′_,m)∈B+1 t Pr{S (t) m′, S (t+1) m , r} P (m′_,m)∈B−1 t Pr{S (t) m′, S (t+1) m , r} = ln P (m′_,m)∈B+1 t α(S (t) m′)γ(S (t) m′, S (t+1) m )β(Sm(t+1)) P (m′_,m)∈B−1 t α(S (t) m′)γ(S (t) m′, S (t+1) m )β(Sm(t+1)) , (2.11) where B+1

t is the set of all (m′, m) that indicate the state transitions are caused by

ut= +1, and B−1t , the set of (m′, m), denotes the state transitions are due to ut= −1.

To decide the decoded output signal ˆut, make a hard decision to the value of LLR,

then ˆut can be estimated as

ˆ ut=      +1 if L(ut) ≥ 0 −1 if L(ut) < 0 (2.12)

(18)

2.1.2 The Log-MAP Decoding Algorithm

From equations (2.6), (2.7), and (2.9), we can realize that the MAP algorithm requires complex hardware resource. In order to simplify hardware complexity, we can transform MAP decoding algorithm into logarithmic domain. At first, we need to transfer the branch metric γ in (2.9) to logarithmic domain; that is

¯ γ(S_m(t)′, S (t+1) m ) = ln γ(S (t) m′, S (t+1) m ) (2.13)

Then, the forward metric α in (2.6) and the backward metric β in (2.7) can be further expressed as ¯ α(S_m(t+1)′ ) = ln α(S (t+1) m′ ) = ln X S(t) m′∈S eγ(S¯ m′(t),S (t+1) m )+ ¯α(S_m′(t))_, _(2.14) and ¯ β(Sm(t)) = ln β(Sm(t)) = ln X S(t+1)m ∈S eβ(S¯ (t+1)m )+¯γ(S(t)_m′,S(t+1)m ) _(2.15)

As the path metrics have been changed, the initial conditions of metrics become ¯ α(S₀(0)) = 0, α(S¯ x(0)) = −∞ for Sx(0) ∈ S\S0 ¯ β(S0(N )) = 0, ¯β(S (N ) x ) = −∞ for Sx(N ) ∈ S\S0 (2.16) Referring to (2.13), (2.14), and (2.15), the LLR in (2.11) can be rewritten as

L(ut) = ln   X (m′_,m)∈B+1 t eα(S¯ (t)m′)+¯γ(S (t) m′,S (t+1) m )+ ¯β(Sm(t+1))   − ln   X (m′_,m)∈B−1 t eα(S¯ (t)m′)+¯γ(S (t) m′,S (t+1) m )+ ¯β(Sm(t+1))   (2.17)

To simplify the logarithmic domain, we consider the Jacobian function [3] ln(ex1 _{+ e}x2₎, max∗_(ex1_{, e}x2_{) = max(x}

1, x2) + ln(1 + e−|x1−x2|), (2.18)

and the correction term ln(1 + e−|x1−x2|_{) can be implemented by a lookup table to simplify}

hardware design. Apply the recursive procedure to (2.18), we can extend the Jacobian function to

ln(ex1_{+ e}x2 _{+ · · · + e}xb₎, max∗_(ex1_{, e}x2_{, . . . , e}xb₎ _(2.19)

(19)

Apply (2.18) to (2.14) and (2.15) ¯ α(S_m(t+1)′ ) = max ∗ S(t) m′∈S [¯γ(S_m(t)′, S (t+1) m ) + ¯α(S (t) m′)] (2.21) ¯ β(S_m(t)) = max∗_S(t+1) m ∈S[ ¯β(S (t+1) m ) + ¯γ(S (t) m′, S (t+1) m )], (2.22) and therefore, L(ut) = max∗_(m′_,m)∈B+1 t [¯α(S (t) m′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(S (t+1) m )] − max∗_(m′_,m)∈B−1 t [¯α(S (t) m′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(S (t+1) m )]. (2.23)

The MAP decoding algorithm based on (2.21), (2.22), and (2.23) is termed Log-MAP algorithm [4, 5].

2.1.3 The Max-Log-MAP Decoding Algorithm

The performance of the Log-MAP algorithm is equivalent to the MAP algorithm but the hardware complexity has been reduced considerably. However, the correction term y = ln(1 + e−|x1−x2|_{) in (2.18) requires a lookup table to simplify the computation.}

In Fig. 2.1, we can discover that y decreases rapidly as x = |x1 − x2| increases. In

order to further simplify the complexity of correction term, it is possible to discard y with some performance degradation because of the information loss.

0 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x y=ln(1+e −x )

(20)

Consequently, we have the following approximations: max∗(ex1_{, e}x2_{) ≈ max(x}

1, x2) (2.24)

max∗(ex1_{, e}x2_{, . . . , e}xn_{) ≈ max}

i=1∼n(xi). (2.25)

Applying (2.24) and (2.25), we can reduce the Log-MAP algorithm to the Max-Log-MAP algorithm that contains only the additions and the max functions. Therefore, we can rewrite (2.21), (2.22), and (2.23) as ¯ α(S_m(t+1)′ ) ≈ max S(t) m′∈S [¯γ(S_m(t)′, S (t+1) m ) + ¯α(S (t) m′)] (2.26) ¯ β(Sm(t)) ≈ max S(t+1)m ∈S [ ¯β(Sm(t+1)) + ¯γ(S (t) m′, S (t+1) m )], (2.27) and L(ut) ≈ max (m′_,m)∈B+1 t [¯α(S_m(t)′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(Sm(t+1))] − max (m′_,m)∈B−1 t [¯α(S_m(t)′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(Sm(t+1))]. (2.28)

2.1.4 Sliding Window Approach

From the previous discussion, we can find that in the MAP-series decoding algorithm (including MAP algorithm, Log-MAP algorithm and MAX-Log-MAP algorithm), the calculation of LLR requires the forward metrics and backward metrics; all of the metrics should be kept to calculate all L(ut) with t = 1 ∼ N. Since the backward recursive

com-putation initials from the end of the decoding trellis, the LLR value cannot be calculated until the entire block metrics received. If the block length is large, it will lead to long output latency and require huge memory for hardware implementation.

To reduce the memory requirement, the sliding window algorithm [6–8] is applied to avoid storing the metrics corresponding to the entire codeword sequence. This algorithm utilizes the fact that the backward metrics can be highly reliable even without the initial condition if the length of backward recursive computation is long enough. In Fig. 2.2, the codeword sequence is divided into ⌈N/Tw⌉ sub-blocks with sliding window length Tw

which is also called the convergence length, and the dummy backward recursion βd is

(21)

j j+1 j+2 j+3 j-1 j+4 d d t1 t2 t3 t4 Window-size

. . .

Tw ( )_t L u ( )_t L u ( )_t L u d d

Figure 2.2: The process of sliding window MAP algorithm

the initial condition for the βd recursion is unknown except the last sub-block, we set the

equally likely conditions for βd within the (j + 1)-th sub-block

βd(Sm((j+1)·Tw)) =

1

M, for all S

(jω)

m ∈ S, (2.29)

where M is the state number in trellis diagram. After the backward recursive computation βd process of Tw time instances, the initial metrics β(Sm(j·Tw)) in the j-th sub-block are

available for the β recursion. During the (j + 1)-th βd operation, the forward α recursion

proceeds concurrently in the j-th sub-block, and all the metric values are stored in the memory. In the backward β recursion of the j-th sub-block, we can calculate the L(ut)

value with the α metrics in the memory, the β metrics in computing, and the corresponding branches metrics in the j-th sub-block. The sliding window length Tw which is set to be

six times constraint length of component encoder in turbo code to ensure the reliable initialization for the β recursion [8].

2.2 Turbo Code

Turbo code, also named parallel concatenated convolutional code (PCCC), convolutional turbo code (CTC), or turbo convolutional code (TCC), was first proposed by C. Berrou, A. Glavieux and P. Thitimajshima in 1993 [9,10]. It has been proved that the performance of turbo code can be close to shannon limit with simple recursive systematic convolutional (RSC) codes concatenated by an interleaver whose length is N. The interleaver permutes the information sequence before the second encoding, introducing code diversity.

(22)

2.2.1 Turbo Encoder

The turbo encoder is composed of two RSC encoders and an interleaver to reorder the information sequence. Note that the RSC encoder must be recursive for better perfor-mance [11]. In Fig. 2.3, the information symbols are encoded to the systematic part v0(D)

and the parity part v1(D); thus, v0(D) = u(D). And the second encoder encodes the

interleaved information symbols ˜u(D) to the parity part v2(D).

Encoder 1 Encoder 2 Interleaver ( )D u ( )D u_ɶ 1( )D v 2( )D v 0( )D v

Figure 2.3: Turbo encoder

2.2.2 Turbo Interleaver

The main reason causing turbo code performance so close to shannon limit is the inter-leaver. As shown in Fig. 2.3, the interleaver permutes the information sequence u(D) to ˜

u(D). Therefore, the interleaver can spread out the burst errors and further eliminate the correlation of the input of two RSC encoders so that the iterative decoding algorithm based on exchanging un-correlated information between two decoders can be applied. Also, the interleaver can break low weight codewords to improve the coding gain.

The code distance spectrum dominates the error-correcting performance of the turbo code. Referring to [12], the process of the interleaver called spectral thinning can reduce the error probability of low weight codewords. If we assume the interleaver performs random permutation, the error probability can be reduced by a factor of 1/N [11, 13], where N is the interleaver size. And 1/N is also refered to the interleaver gain. The

(23)

size and the permutation will considerably affect the turbo code performance. At low SNRs, the interleaver size has the most important effect, whereas the permutation would dominate the error performance at high SNRs. Consequently, the interelaver structure is desirable to break these input patterns. In such case, the input sequence to the second encoder, which is generated by the interleaver, will most likely produce a high weight parity check sequence and further increase the whole turbo codeword weight.

2.2.3 Turbo Decoder

The iterative turbo decoding process based on MAP algorithm is to exchange the soft information among soft-in/soft-out (SISO) decoders to calculate a posteriori probability of each information bit ut [2]. For a code rate 1/n RSC encoder, each codeword frame

consists of one systematic bit and (n−1) parity bits. In the receiver, the received codeword has the systematic symbol r0,t and the parity symbols r(1)t ∼ r

(n−1) t . If the a priori information is represented by La(ut), ln P (ut= +1) P (ut= −1) , (2.30)

additionally, the channel reliability value Lc is defined to be 4E_N₀s for the AWGN

chan-nel [14], and the branch metric in logarithmic domain would be ¯ γ(S_m(t)′, S (t+1) m ) = ln P (ut)P (rt|ˆvt) = 1 2(utLa(ut) + Lcutr0,t+ n−1 X i=1 Lcr(i)t vˆ (i) t ), (2.31)

which is from (2.9) and (2.13).

As a result, the APP information from the SISO decoder can be derived as follows:

L(ut) = ln P (m′_,m)∈B+1 t eα(S¯ m′(t))+¯γ(S (t) m′,S (t+1) m )+ ¯β(Sm(t+1)) P (m′_,m)∈B−1 t eα(S¯ (t)m′)+¯γ(S (t) m′,S (t+1) m )+ ¯β(Sm(t+1)) = ln P (m′_,m)∈B+1 t e12((+1)La(ut)+(+1)Lcr,0t) eα(S¯ m′(t))+ 1 2 Pn−1 i=1 Lcr(i)t ˆv (i) t + ¯β(S (t+1) m ) P (m′_,m)∈B−1 t e12((−1)La(ut)+(−1)Lcr0,t) eα(S¯ m′(t))+ 1 2 Pn−1 i=1 Lcrt(i)vˆ (i) t + ¯β(S (t+1) m ) = La(ut) + Lcr0,t+ ln P (m′_,m)∈B+1 t eα(S¯ m′(t))+ 1 2 Pn−1 i=1 Lcr(i)t vˆ (i) t + ¯β(S (t+1) m ) P (m′_,m)∈B−1 t eα(S¯ (t)m′)+ 1 2 Pn−1 i=1 Lcr(i)t vˆ (i) t + ¯β(S (t+1) m ) = La(ut) + Lcr0,t+ Le(ut). (2.32)

(24)

The term Le(ut) is the extrinsic information corresponding to the information bit ut[9,10]. SISO decoder-1 SISO decoder-2 Interleaver Interleaver De-interleaver 1( ) a t L u 1( )t L u 1( ) e t L u La₂( )uɶt 2( )t L u_ɶ 2( ) e t L u_ɶ 0,t r 1,t r 2,t r 0,t r_ɶ

Figure 2.4: Turbo decoder

In the decoder, we receive the systematic sequence r0(D) as well as the parity sequences

r1(D) and r2(D) from encoder 1 and encoder 2. In the decoding flow shown in Fig. 2.4,

there are two SISO decoders for the two constituent encoders in Fig. 2.3. Initially, we set the a priori information La1(ut) for the first decoder to zero and apply the BCJR

algorithm to calculate the a posteriori information L1(ut). From (2.32), the extrinsic

information Le1(ut) can be obtained

Le1(ut) = L1(ut) − Lcr0,t− La1(ut), (2.33)

where La1(ut) = 0 initially. In the SISO decoder-2, the inputs are ˜r0(D) permuted from

the systematic part r0(D) and the parity sequence r2(D), while the a priori information

La2(˜ut) is the extrinsic output Le1(ut) from decoder-1 after permutation. Consequently,

we can evaluate the a posteriori output L2(˜ut) and the extrinsic information Le2(˜ut)

corresponding to the second constituent code by

Le2(˜ut) = L2(˜ut) − Lcr˜0,t− La2(˜ut). (2.34)

As shown in Fig. 2.4, the information Le2(˜ut) can be regarded as the the a priori

information La1(ut) for SISO decoder-1 after being reordered by the de-interleaver. The

BCJR algorithm proceeds again for the first constituent code based on the information La1(ut) from SISO decoder-2. The turbo decoding proceeds iteratively with the extrinsic

(25)

information passing between the two SISO decoders. When the stopping criteria are reached, which may be the maximum iteration number or a correctly decoded codeword, the APP information L2(˜ut) through the de-interleaver is exported for hard decision.

Notice that both SISO decoders in Fig. 2.4 will complete once within each decoding iteration.

The BER curve of turbo code can be divided into three regions [15], at very low SNRs, the signal is so greatly corrupted by channel noise that the decoder cannot improve the error rate and may even degrade it. The non-convergence region has an almost constant and high error probability. As the SNR increases, a waterfall region is encountered where the error rate drops sharply. As the SNR increases still further, a error floor region is encountered where the curve becomes less steep, limiting the performance gains. This error floor region is primarily a function of the distance properties of the code, which can be expressed by (2.35) Pb ∝ Q r 2df reeR Eb N0 ! , (2.35)

where df ree is the code minimum free distance, R is the code rate, and _NEb₀ is the SNR.

2.3 Double-binary Convolutional Turbo Code

Decod-ing Algorithm

Double-binary convolutional turco code (CTC) can provide better performance than single binary turbo code for equivalent compexity [16]. This section will introduce double-binary CTC with tail-biting technique which can avoid reducing the code rate and increasing the transmission bandwidth. Using double-binary CTC, the latency of the decoder is halved [17], and it could be easily adopted in many standards, such as DVB-RCS and WiMAX standards [1, 18].

2.3.1 Double-binary CTC Encoder

The bouble-binary CTC encoder is shown in Fig. 2.5. Compare to the conventional turbo code, there has two systematic bits, so the number of branches connected to each state in trellis diagram are increased from two to four.

(26)

Encoder 1 Encoder 2 Interleaver 0( )D u 0( )D u_ɶ 01( )D v 02( )D v 00( )D v 1( )D u 10( )D v 11( )D v 12( )D v 1( )D u_ɶ

Figure 2.5: Double-binary Convolutional Turbo encoder

For conventional turbo encoder, we should add tail bits to force the trellis diagram to finish at zero state. The trellis termination makes sure that the initial state for the next block is the all-zero state, but the tail bits will decrease the code rate and degrade the transmission efficiency, and the degradation will be more for the shorter blocks. Using tail-biting application, also called circulation states, the state of the encoder at the beginning of the encoding process is not necessarily the all-zero state. The fundamental idea behind tail-biting is that the encoder is controlled in such a way that it starts and ends the encoding process in the same state [19].

The circular coding ensures that, at the end of the encoding operation, the encoder retrieves the initial state, so that data encoding may be represented by a circular trellis. Assume there exists such a circulation state Sc, if the encoder starts from state Sc, it

comes back to the same state when the encoding process is finished. The derivation of circulation state Sc requires a pre-encoding operation. First, the encoder is initialized in

the all zero state, and the data sequence of length N is encoded once, leading to a final state Sm(N ). Second, we find Sc from the final state Sm(N ) by the following equation [19]:

Sc = I + GN

−1

× S_m(N ), (2.36) where G is the generator matrix which comes from encoder, and I is the identity matrix. Finally, data are encoded starting from the state Sc calculated by (2.36).

(27)

2.3.2 Decoding Procedure for Double-binary CTC

According to the iterative decoding algorithm of turbo codes in section 2.2, we realize that the goal of the MAP decoding algorithm is to achieve the extrinsic and LLR values. Therefore, for the input signals u0,t and u1,t, the LLR for i = 1, 2, 3 can be represented as

Li(dt), ln

Pr {dt= i|r}

Pr {dt= 0|r}

, (2.37)

where dt in GF (22) is defined as the collection of input symbols (u0,t, u1,t) with elements

{0, 1, 2, 3} from time (t − 1) to time t (that is, dt= 00, 01, 10, 11. We use decimal notation

instead of binary for simplicity), and r is received symbol after QPSK mapping. The decomposition of the above equation will be

Li(dt) = ln P (m′_,m)∈Bi tPr{S (t) m′, S (t+1) m |r} P (m′_,m)∈B0 t Pr{S (t) m′, S (t+1) m |r} = ln P (m′_,m)∈Bi tPr{S (t) m′, S (t+1) m , r} P (m′_,m)∈B0 t Pr{S (t) m′, S (t+1) m , r} = ln P (m′_,m)∈Bi tα(S (t) m′)γ(S (t) m′, S (t+1) m )β(Sm(t+1)) P (m′_,m)∈B0 t α(S (t) m′)γ(S (t) m′, S (t+1) m )β(Sm(t+1)) , (2.38) where Bi

t is the set of all (m′, m) that indicate the state transitions are caused by dt= i,

and B0

t, the set of (m′, m), denotes the state transitions are due to dt= 0.

Applying the Log-MAP algorithm to the (2.38), the LLR can be rewritten to Li(dt) = ln P (m′_,m)∈Bi te ¯ α(S(t)_m′)+¯γ(S_m′(t),Sm(t+1))+ ¯β(Sm(t+1)) P (m′_,m)∈B0 t e ¯ α(S(t) m′)+¯γ(S (t) m′,S (t+1) m )+ ¯β(S(t+1)m ) = max∗(m′_,m)∈Bi t[¯α(S (t) m′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(Sm(t+1))] − max∗(m′_,m)∈B0 t[¯α(S (t) m′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(Sm(t+1))], (2.39)

and the Max-Log-MAP approximation will become Li(dt) ≈ max (m′_,m)∈Bi t [¯α(S_m(t)′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(Sm(t+1))] − max (m′_,m)∈B0 t [¯α(S_m(t)′) + ¯γ(S (t) m′, S (t+1) m ) + ¯β(Sm(t+1))]. (2.40)

Since the tail-biting is applied on circular trellis diagram, we have equally likely symbols. Thus, the initial condition of branch metrics become

¯

α(Sx(0)) = 0 for ∀ Sx(0) ∈ S

¯

β(Sx(N )) = 0 for ∀ Sx(N ) ∈ S.

(28)

SISO decoder-1 SISO decoder-2 Interleaver Interleaver De-interleaver 1( ) i a t L d

(

)

i t

L d

1( ) i e t L d 2( ) i a t L dɶ ( ) i t L dɶ 2( ) i e t L dɶ 0 ,t

r

1,t

r

4 ,t

r

0 ,t

rɶ

2 ,t

r

3,t

r

5,t

r

1,t

rɶ

Figure 2.6: Double-binary CTC decoder

For a code rate 1/n double-binary RSC encoder, each codeword frame consists of two systematic bits and 2(n − 1) parity bits. In the receiver, the received codeword has the systematic symbols rt(0), r

(1)

1 and the parity symbols r (2) t ∼ r

(2n−1)

t . Moreover, in order

to reduce the computational complexity, to increase throughput, or to reduce the power consumption, we could further simplify the path metrics into

¯ γ(S_m(t)′, S (t+1) m ) = L i a(dt) + 2n−1 X j=0 bj· r (j) t , (2.42)

where the value of bj ∈ {+1, −1} depends on the encoding polynomial after BPSK

map-ping and can be pre-calculated for all state transitions, respectively. The a priori infor-mation in (2.42) is represented by Li a(dt), ln P (dt= i) P (dt= 0) . (2.43)

From the decoding flow shown in Fig. 2.6, the extrinsic information for next stage can be calculated as

Lie(dt) = Li(dt) − [(b0· r0,t+ b1· r1,t) − (r0,t+ r1,t)] − Lia(dt) . (2.44)

Compute symbol probabilities for the next decoder from previous decoder as Lia(dt) = Lie ˜dt

= ln P (dt= i) P (dt = 0)

(29)

to save the hardware resource, we can define ln P (dt= 0) to 0. Hence, the a priori

information can be rewritten as follows:          ln P (dt = 1) = L1e ˜dt ln P (dt = 2) = L2e ˜dt ln P (dt = 3) = L3e ˜dt (2.46)

Assume the information symbols are equal probability, so we initialize the a priori infor-mation for the first iteration: _

              ln P (dt= 0) = 0 ln P (dt= 1) = 0 ln P (dt= 2) = 0 ln P (dt= 3) = 0 (2.47)

The double-binary turbo decoding proceeds iteratively with the extrinsic information pass-ing between the two SISO decoders. When the stopppass-ing criteria are reached, which may be the maximum iteration number or a correctly decoded codeword, the final decisions are made according to:

˜ dt=                01, if L ˜dt = L1 ˜dt > 0 10, if L ˜dt = L2 ˜dt > 0 11, if L ˜dt = L3 ˜dt > 0 00, else (2.48) where L ˜dt = maxL1 ˜dt , L2 ˜dt , L3 ˜dt (2.49)

2.4 Stochastic Iterative Decoding Algorithm

Stochastic arithmetic was first introduced in 1960’s as a method to design low-precision digital circuits [20]. Due to the hardware implementations of iterative decoding for error control code become more complex, much research effort has been invested to reduce hardware complexity. The major motivation for considering stochastic computation was the possibility of performing complex computations using only simple logic circuit. In stochastic computation, probabilities are represented as streams of random digital bits using Bernoulli sequences. With this representation, complex operations on probabilities

(30)

such as multiplication and division can be converted to operations on bits which can easily be implemented using simple stochastic gates, but to trade off between computation accuracy and computation time.

2.4.1 Stochastic Computation

In a stochastic computation, values are encoded as a Bernoulli sequence of bits. For an unsigned number N, the probability that any bit di in the Bernoulli sequence is a binary

1 is

P (di = 1) =

N Nmax

, (2.50)

for a probability value Pin, the probability of i-th bit di being a binary 1 is

P (di = 1) = Pin. (2.51)

From the above equation, we can realize that the information is contained in the statistics of the bit stream, and there is no fixed mapping between probability value and encoded sequence. And the precision can be decided by the length of stochastic sequence, so we can increase the precision of stochastic streams by increasing the sequence length.

Consider the stochastic multiplier in Fig. 2.7. Let Pa = Pr (ai = 1) and Pb = Pr (bi = 1)

be the input probability, and Pc is the output probability. The multiplication of two

stochastic sequences can be performed with a single two-input AND gate [21].

P_a= 0.5 ...0110001011... Pb = 0.4 ...1000101001... Pc = 0.2 ...0000001001...

Figure 2.7: Multiplication of two stochastic sequences

The JK flip-flop shown in Fig. 2.8 can be used to perform stochastic division. The probability of random output transition from 0 to 1 and from 1 to 0 is ((1 − Pc)Pa) and

(PcPb), respectively. Since the expected occurrence of random sequence in both direction

must be equal, then we have

(31)

J K Q clk

{ }

c

i

{ }

ai

{ }

bi a c a b P P P P = + a

P

b P J K Q 1 0 0 1 0 1 0 1 1 0 Hold Reverse

Figure 2.8: Division of two stochastic sequences

For the stochastic addition and subtraction, in order to ensure the operations to be closed on the probability interval of [0, 1], therefore, these operations should be combined with a scaling operation for the outcome [21]. Addition with scaling is performed as

Pc = N X i=1 SiPAi where N X i=1 Si = 1, (2.53)

The outcome is the scaled sum of the input probabilities. For Si = 1/N , this operation

can be implemented in hardware using a multiplexer as shown in Fig. 2.9, where RS refers to the random selection supplied by (pseudo) random number generators. Generating RS is straightforward when the N is a power of two.

RS c

P

1 A P 2 A P N A P

(32)

2.4.2 Stochastic Stream Generation

For the implementation of the trellis-based stochastic decoder, the channel value yi

re-ceived from AWGN channel should be converted to LLR first Li = ln P (yi = 1) P (yi = 0) = ln(e−2σ21 ((yi+1) 2_−(y i−1)2) ) = −1 2σ2(4yi) = −1 N0 (4yi) (2.54)

The further conversion form LLR value in (2.54) to probability is shown below Pi= P (yi = 1) P (yi = 1) + P (yi = 0) = 1 1 + P(yi=0) P(yi=1) = 1 1 + e−Li (2.55)

Assume we use N-bit representation for the received probabilities, these probabilities are converted to stochastic streams by using the structure shown in Fig. 2.10. This structure consists of a comparator which compares the channel probability, P , with a (pseudo) random number, R. The channel probability P is fixed during the decoding process, but R is a random number (with a uniform distribution) which is updated in every decoding cycle. The output bit of the comparator is equal to 1 if P > R, else it is equal to 0. Since R has a uniform distribution and can take a value from 0 to 2N _{− 1,}

each bit in the output stochastic stream is equal to 1 with a probability of P 2N [22]. Input Probability N N Random Stream 1 Stochastic Stream P R P>R Comparator

(33)

2.4.3 Trellis-based Stochastic Decoding Algorithm

The stochastic decoding algorithm is a message-passing algorithm, which is based on the code constraint graph. To implement the stochastic message-passing algorithm, we use the following deterministic message update rule at each function node [23]. Consider the propagation of message from (Ai(T ) , Bi(T )) to Ci+1(T + 1) with i = 1 ∼ N, where

Ai(T ) and Bi(T ) are received messages, Ci+1(T + 1) is transmitted message, and N is

the block length. Assume at time instance T , Ai(T ) = a and Bi(T ) = b. Then

Ci+1(T + 1) =    fC(a, b) if (a, b) ∈ S Ci+1(T ) otherwise (2.56) It is sometimes convenient to refer to the set S as the satisfaction of the constraint function fC. For each row (a, b, c) in the satisfaction table in Fig. 2.11, there is a branch in the

trellis which connects a with c, and which is labeled b. This relationship of a (2, 1, 3) convolutional code is illustrated by Fig. 2.11.

B A C 0 3 4 1 3 0 6 2 7 6 1 3 5 3 6 5 0 2 4 0 6 4 3 2 3 2 5 3 1 1 2 1 5 2 2 1 1 0 4 7 2 3 7 1 7 0 0 0 a b c (a) (b)

Figure 2.11: An example of constrain node (a), showing a detailed trellis description of its constraint; and (b) the set S corresponding to this constraint

(34)

state transition parameter in (2.56) as follows: • a : Source state

• b : Branch codeword • fC(a, b) : Destination state

000 00 11 000 100

( )

i A T

( )

i B T

(

)

1 1 i C T + +

Figure 2.12: Update rule for stochastic decoding algorithm

A simple trellis diagram for a (2, 1, 3) convolutional code is shown in Fig. 2.12, and the corresponding value sets of state transition are defined as follows:

A = {000, 001, 010, . . . , 111} B = {(00), (01), (10), (11)} C= {000, 001, 010, . . . , 111}

(2.57)

Furthermore, assume at time instance T , the branch metrics become

Bi(T ) = {Bi,0(T ) , Bi,1(T ) , . . . , Bi,N−1(T )} , (2.58)

where

i = 0 ∼ (Block Length − 1) T = 0 ∼ (Decoding Cycle − 1)

(2.59)

Referring to (2.56), there are three possible value for the destination state in Fig. 2.12

Ci+1(T + 1) =          fC(000, (00)) = 000 fC(000, (11)) = 100 Ci+1(T ), (2.60)

(35)

we can find that if the received codeword (Bi,0, Bi,1) (ignore the time instance T in notation

for simplicity) is 00, the destination state Ci+1 is 0; if the codeword is 11, the destination

is 4, else the destination state would be remain the same state as last decoding cycle. By using the update rule, the trellis-based stochastic decoder with matched codeword can be implemented in Fig. 2.13(a). We can find that branch metric (Bi,0, Bi,1) is

stochas-tic stream which is generated by a comparator with input probability and random stream. With the matched codeword 11, the initial state is updated from initial state to state 3. At the same time, we increase the counter because of the transmitted bit of matched branch is ”1”, if the transmitted bit is ”0”, we decrease the counter. The main function of this counter is to make the final decision to convert stochastic stream to digital bit.

Furthermore, if there doesn’t have any matched codeword as shown in Fig. 2.13(b), the destination state will remain the same state as the previous decoding cycle. Under this condition, the counter will remain the same value. When the maximum decoding cycle is reached, the counter is exported for the hard decision.

(b)

(a)

B

i,0

B

i,1 00 11 Initial state CMP Input Probability Random Stream (Stochastic Stream) Update state Mis-matched codeword Matched codeword CMP Input Probability Random Stream

Assume Bi,0Bi,1 = 11

B

i,0

B

i,1 00 11 CMP Input Probability Random Stream CMP Input Probability Random Stream Remain the same state Initial state Mis-matched codeword (Stochastic Stream)

Assume Bi,0Bi,1 ≠ 00 or 11

Figure 2.13: Trellis-based stochastic decoder. (a) With matched codeword. (b) With mis-matched codeword

(36)

Chapter 3 Trellis-based Stochastic Decoder

3.1 Analysis of Stochastic Update Rule

As we described in section 2.4, the error-correcting performance of trellis-based stochastic decoder can be adjusted by decoding cycles. Fig. 3.1 shows the performance comparison of the uncoded sequence, hard-decision Viterbi decoding algorithm, soft-decision Viterbi decoding algorithm and the trellis-based stochastic decoding algorithm with decoding cy-cle 2500. All of the simulation environment is under AWGN channel and BPSK mapping using (2, 1, 3) convolutional code, and the random stream in Fig. 2.13 is generated by ran-dom number function in C++ programming. Besides the performance curve of stochastic decoder is fixed point simulation with quantization width 10, the other performance curves are floating point simulation.

From the simulation result, we can find that the performance of the stochastic decoder is even worse than the uncoded sequence. The reason might be the received Log-Likelihood Ratios (LLRs) become so large so that the corresponding probabilities approach ”0” (or ”1”). In this case, bits in stochastic sequences are mostly ”0” (or ”1”), hence random switching events become too rare for proper decoding [24]. In the following section, we will discuss some methods to improve the performance of trellis-based stochastic decoder.

(37)

0 1 2 3 4 5 6 7 8 9 10−6 10−5 10−4 10−3 10−2 10−1 100

BPSK; AWGN; (2, 1, 3) Convolutional Code with Stochastic Update Rule

Eb/No(dB)

BER

Uncoded

Soft−decision Viterbi Algorithm Hard−decision Viterbi Algorithm

Stochastic Decoder with Decoding Cycle = 2500

Figure 3.1: Performance of stochastic decoder

3.2 State Memory Method

One major difficulty observed in trellis-based stochastic decoding algorithm is the latching problem. The latching problem refers to the case where a cycle in the trellis diagram causes the state transition to lock into a fixed state. The mis-matched codeword caused the trellis diagram to remain the same state is the main reason of the latching problem.

To avoid the state transition to lock into a fixed state during decoding cycle and in-crease the random switching activity of stochastic sequences in the trellis diagram during latching, we propose the state memory to store the state with matched codeword tran-sition. As shown in Fig. 3.2, the state memory store the unpdate state with matched codeword and increase the vlaid counter to record the memory index when the decoding cycle is bigger than Tinit which is a parameter to reduce the chance of locking into a fixed

state.

Fig. 3.3 shows the operation condition of state memory when the mis-matched code-word occurs in trellis diagram. To increase the sensitivity to the bit transition for proper decoding operation, in case of the updating rule is failed (codeword mis-matched), we

(38)

Bi,0Bi,1 Bi-1,0Bi-1,1 Previous cycle Update state State Memory Block Length C Y C L E VALID COUNT 3 5 2 3 3 4 5 2 6 4

Figure 3.2: State memory with matched codeword

randomly select state from the state memory to update trellis diagram. This updating scheme reduces the chance of locking into a fixed state since every time the mis-matched codeword happens, the state is randomly chosen from those previous update sates which are not produced into latching problem.

To increase the state transition activity, the usage of state memory would be applied to both forward and backward transition in trellis diagram (just like the forward metric and backward metric in BCJR decoding algorithm). Double-side state memory application is shown in Fig. 3.4, the forward transition and backward transition update state every decoding cycle simultaneously. The decisions of the forward and backward transition are also calculated in each decoding cycle. When the maximum decoding cycle is reached, the counter is exported for the hard decision.

(39)

Bi,0Bi,1 Bi-1,0Bi-1,1 Previous cycle Update state State Memory Block Length C Y C L E VALID COUNT 3 5 2 3 4 5 2 6 4

Figure 3.3: State memory with mis-matched codeword

The simulation result and performance comparison is show in Fig. 3.5, the simulation environment is the same as Fig. 3.1. As shown, the stochastic decoder with state memory improves the performance comparing to that without state meory and provides better performance at low SNRs with respect to hard-decision Viterbi decoding algorithm. But at high SNRs, the BER curve of stochastic decoder with state memory has error floor and the performance is worse than the hard-decision Viterbi decoding algorithm. In the next section, we will discuss this condition and introduce further improvement to solve this problem.

(40)

forward transition

B

i,0

B

i,1 00 11 Initial state Update state Input Probability Random Stream

Assume Bi,0Bi,1 = 11 00 11 Initial state

B

i,0

B

i,1 Update state backward transition CMP CMP 0 1 D forward decision backward decision

M

U

X

-2 2 0 Hard Decision Decoded Data Input Probability Random Stream

Figure 3.4: Double-side state memory application

3.3 Noise Dependent Scaling Factor

The Noise Dependant Scaling (NDS) factor proposed in [24] introduced that the received channel LLRs are down-scaled by a scaling factor which is proportional to the SNR. The down-scaled LLRs result in probabilities which introduce more switching activity in the stochastic decoder. Because the scaling factor is proportional to the noise level, it ensures a similar level of switching activity for different SNRs. Assuming a BPSK transmission over AWGN channel, the original LLR (Li) is

Li =

−1 N0

(41)

0 1 2 3 4 5 6 7 8 9 10−6 10−5 10−4 10−3 10−2 10−1 100

BPSK; AWGN; (2, 1, 3) Convolutional Code with State Memory

Eb/No(dB)

BER

Uncoded

Soft−decision Viterbi Algorithm Hard−decision Viterbi Algorithm

Original Trellis Decoding with CYCLE=2500 State Memory Usage with CYCLE=5000

Figure 3.5: Stochastic decoder with state memory usage

where yi is received symbol and N0 is the single-sided noise power spectral density. The

scaled LLR is calculated as L′_i = αN0 Y Li = αN0 Y −1 N0 (4yi) =α Y (−4yi), (3.2)

Y is the fixed maximum value of the received symbols, and α is a constant factor. As a result, α

Y is the noise dependent scaling factor. Fig. 3.6 shows the peroformance of different

NDS factors with state memory usage, the NDS factor 0.6 is the best case compared to others and without error floor effect when BER is 10−5_.

(42)

1 2 3 4 5 6 7 10−6 10−5 10−4 10−3 10−2 10−1 100

BPSK; AWGN; (2, 1, 3) Convolutional Code with NDS and State Memory

Eb/No(dB)

BER

Soft−decision Viterbi Algorithm Hard−decision Viterbi Algorithm without NDS

NDS = 0.6 NDS = 0.7 NDS = 0.8 NDS = 0.9

Figure 3.6: NDS factor comparison

3.4 Discussion

Further improvement with state memory and noise dependent scaling factor 0.6 applied to trellis-based stochastic decoder really enhance the possibilty to implement the stochastic decoding algorithm. Besides, the number of decoding cycles also affect the error-correcting performance and throughput of stochastic decoder, the analysis is shown in Fig. 3.7. Based on different SNRs (from 0 to 7), the decoding cycle 2500 seems to be enough to have outstanding error-correcting performance.

Fig. 3.8 shows the 1-stage stochastic decoder architecture, and the corresponding code-word and the destination state can be referred to LUTB and LUTC, respectively. Shorter

critical path compared to conventional ACS-unit is also labeled in Fig. 3.8. The synthesis result shows that the 1-stage trellis-based stochastic decoder can be operated at 1.8GHz by using UMC 90nm CMOS process. Although the throughput can be enhanced by us-ing stochastic decodus-ing algorithm, there still has some dis-advantage for implementation. First, performance loss about 1dB at BER= 10−5 _{as compared with soft-decision Viterbi}

(43)

1500 2000 2500 3000 3500 4000 4500 5000 10−6 10−5 10−4 10−3 10−2 10−1 100

BPSK; AWGN; Decoding Cycle Analysis

Decoding cycle BER SNR = 0.0 SNR = 0.5 SNR = 1.0 SNR = 1.5 SNR = 2.0 SNR = 2.5 SNR = 3.0 SNR = 3.5 SNR = 4.0 SNR = 4.5 SNR = 5.0 SNR = 5.5 SNR = 6.0 SNR = 6.5 SNR = 7.0

Figure 3.7: Decoding cycle analysis of stochastic decoder

algorithm. Second, the required functional units will be propotional to block length which may result in larger hardware cost about 187K gate counts. As a result of area-efficient design for WiMAX standard, the original decoding algorithm can achieve design require-ment and would be adopted for hardware implerequire-mentation. Furthermore, the stochastic decoding algorithm can be applied to the requirement of high throughput standard, such as IEEE 802.16m.

(44)

Path Metric LUTB CMP CMP Counter D M U X Stochastic Stream Hard Decision Decoded Data D LUTC 3 3 3 2 2 3 Critical Path Bi(T) Ci+1(T+1) Ai(T) Ci+1(T) 2 Tcritical = 0.55ns @ 90nm

(45)

Chapter 4 Double-binary CTC Decoder for

WiMAX 802.16e Application

4.1 Introduction of WiMAX 802.16e Standard

In WiMAX 802.16e, channel coding considers convolutional turbo codes (CTC) as an optional code. It uses double-binary turbo codes to improve error correcting performance and decoding throughput. Besides double binary turbo code, 802.16e provides 17 modes to support different block sizes. In order to support various block sizes, the interleaver is designed as a function with five parameters.

Because the different parameters for different modes are challenges of hardware imple-mentation, how to minimize a configurable parameter controller architecture is the main concern in this chapter. Moreover, since the block length ranges from 24 to 2400, the memory requirement is also a great issue.

Fig. 4.1 illustrates the turbo encoder block diagram [1]. It consists of two circular recursive encoders, and the code rate of CTC encoder is 1/3. Each encoder generates two additional parity bits using two information bits. The polynomials defining the connec-tions and symbol notaconnec-tions are described as follows:

• For the feedback branch: 1 + D + D3

• For the Y parity bit: 1 + D2_{+ D}3

(46)

CTC Interleaver Constituent encoder A B 1 2 Y1W1 Y2W2 C1 C2 S1 S2 S3 A B Systematic part Parity part Constituent encoder Sw itc h Y W A B

Figure 4.1: CTC encoder for WiMAX standard

The trellis diagram generated by the circular encoder is shown in Fig. 4.2. Each state receives four branch metrics (information symbol dt= 00,01,10,11) and also sends four

messages to other states. As a result, radix-4 ACS unit is required to decode the trellis diagram.

The state of the encoder is denoted S (0 ≤ S ≤ 7) with S the value read binary (left to right) out of the constituent encoder memory (referring to Fig. 4.1). The circulation states Sc in (2.36) is determined by the following operations:

Step 1: Initialize the encoder with state 0. Encode the sequence in the natural order for the determination of Sc. Assume the final state of the encoder is S0N −1.

Step 2: According to the length N of the sequence, use Table 4.1 to find Sc.

In 802.16e, five parameters, including the block length N, P0, P1, P2 and P3 are

specified in Table A.1 and Table A.2, and the parameters when supporting H-ARQ are specified in Table A.3. The interleaver address in CTC is shown as Table 4.2, where j is the index of memory address for MAP decoder 2 (for decoding interleaved data) and P (j) is the index of memory address from MAP decoder 1 (for decoding deinterleaved data). The most important operation in this table is the modulo operation. It requires a

(47)

S1 S0 S7 S6 S5 S4 S3 S2 S1 S0 S7 S6 S5 S4 S3 S2 forward metric : 00 : 11 : 10 : 01 AB backward metric ( )

( )

' t m S α

( )

( )t 1 m S β +

Figure 4.2: Trellis diagram of double-binary CTC

divider which occupies large area and increases the delay of critical path.

4.2 Simulation Analysis and Parameter Decision

Based on the double binary CTC decoding algorithm in section 2.3, the simulation result can be discussed in this section. In order to determine appropriate design parameters such as the bit widths of the path metric, branch metric, and the input symbol, the performance evaluation through simulations are necessary. In turbo decoding process, the iteration number and the sliding window size will directly influence not only the performance of turbo decoding but also the memory requirement of the design. The bit error rate (BER) curves of the floating point decoders under QPSK modulation and AWGN channel with block length of 2400 are presented in Fig. 4.3. In Fig. 4.3, we can realize that at the same iteration number 5, there is a 0.6dB loss between the sliding window size of 5 and 12 when the BER is 10−5_{. However, the performance curves between the sliding window size 12}

(48)

Table 4.1: Circulation state lookup table (Sc) Nmod7 S0N −1 0 1 2 3 4 5 6 7 1 0 6 4 2 7 1 3 5 2 0 3 7 4 5 6 2 1 3 0 5 3 6 2 7 1 4 4 0 4 1 5 6 2 7 3 5 0 2 5 7 1 3 4 6 6 0 7 6 1 3 4 5 2 Table 4.2: Interleaver Function f or j = 1 to N − 1 Case(j mod 4) Case0 : P (j) = (P0× j + 1) mod N Case1 : P (j) = (P0× j + 1 + N/2 + P1) mod N Case2 : P (j) = (P0× j + 1 + P2) mod N Case3 : P (j) = (P0× j + 1 + N/2 + P3) mod N

and 20 are almost the same. Also, we compare the different iteration number at the same sliding window size 12, there is a 0.6dB loss between the iteration number 5 and 3 when the BER is 10−5_.

Although Max-Log MAP decoding algorithm introduced in section 2.1.3 can reduce the decoding complexity, it invokes the performance loss due to the approximation of max function. The approximation usually overestimates the value of messages. In order to compensate the performance loss, we introduce a scaling factor to scale down the extrinsic message. Therefore, the intrinsic information Li

a(dt) can be formulated as follow:

Li

a(dt) = β × Lie ˜dt

, (4.1)

where β is the scaling factor. From Fig. 4.4 we can figure out that if the normalization factor is 0.75 in Max-Log MAP algorithm with block length of 2400, the performance has only less than 0.1dB loss and has more than 0.3dB performance gain from Max-Log MAP algorithm which will be very close to Log-MAP algorithm and this step would not cost a

(49)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10−8 10−7 10−6 10−5 10−4 10−3 10−2 10−1 100

BPSK; AWGN; Window Size and Iteration Analysis

Eb/No(dB)

BER

Window size = 12, Iteration = 5 Window size = 12, Iteration = 3 Window size = 20, Iteration = 5 Window size = 5, Iteration = 5

Figure 4.3: Comparison of iteration number and window size

lot of hardware area.

The fixed point representation of the internal variable in the MAP decoder is deter-mined from the received symbol quantization. Fig. 4.5 shows the simulation result with the different input symbol quantization under QPSK modulation and AWGN channel, the block length of turbo decoder is 2400, the sliding window size is 12, and the iteration number is 5. Note that (a.b) shown in the figure denotes the quantization scheme where a is the number of bits used in for the integer part a, and b is the number of bits used for the fractional part. Simulation result shows that performance of input symbol [4.2], intrinsic information [5.2], bit width of metrics 10 is the recommended for the double-binary CTC decoder which is close to the floating point Max-log MAP algorithm and we summarize the fixed representation in Table 4.3.

Table 4.3: Summary of fixed representation in turbo decoding

Quantities Input symbols Intrinsic information Branch metrics Path metrics

(50)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10−6 10−5 10−4 10−3 10−2 10−1 100

BPSK; AWGN; Scaling Factor Analysis

Eb/No(dB) BER SF = 0.5 SF = 0.625 SF = 0.75 SF = 0.875 SF = 1.0 Log−MAP

Figure 4.4: Scaling factor comparison

4.3 Proposed Architecture of WiMAX CTC Decoder

The block diagram of proposed architecture is illustrated in Fig. 4.6. There are four memory blocks for message storage, where store input information and extrinsic infor-mation generated by the SISO MAP decoder. The Finite-State-Machine (FSM) controls the iterative decoding procedure and decides which state is proceeding. Furthermore, two interleaver units are used to generate the read address (from MEM EXT to MAP decoder) and the write address (from MAP decoder to MEM EXT). By means of MEM ADDR control unit, the memory addresses are generated to store or access data.

4.3.1 MAP Decoder

Fig. 4.7 shows the architecture of MAP decoder, which consists of branch metric unit (BMU), add-compare-select (ACS) unit, log-likelihood-ratio (LLR) unit, and buffers. The BMUs compute branch metrics for ACS-α, ACS-β, and ACS-βd and each ACS unit

(51)

per-0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 10−6 10−5 10−4 10−3 10−2 10−1 100

BPSK; AWGN; Fixed Point Analysis

Eb/No(dB)

BER

Input=3.3 , Extrinsic=4.3 , Metrics=10 Input=3.3 , Extrinsic=4.3 , Metrics=11 Input=4.2 , Extrinsic=5.2 , Metrics=10 Input=4.3 , Extrinsic=5.3 , Metrics=11 Input=4.1 , Extrinsic=5.1 , Metrics=10 Max−Log MAP Floating Point

Figure 4.5: Fixed point comparison

forms Add-Compare-Select operation. ACS-α carries out the forward recursion and saves the results in the α-memory. ACS-β starts backward recursion from the initial conditions determined by the ACS-βd previously. At the same time, the LLR calculator determines

Li(dt) and Lie(dt). Buffer units reorder the input sequence within one sliding window size,

and the α buffer is a Last-In/First-Out (LIFO) buffer used to reorder each state of α value for LLR caculation.

To consider the sliding window approach in Fig. 2.2, the backward metrics β evaluation can be started when the required window of data have been stored. However, if we reverse the order of input sequence within a window size, the input buffer of the βd calculation

can be saved [25]. Fig. 4.8 is the timing flow of MAP decoder. In order to eliminate the IBUF for βd-ACS, the input order of MAP decoding is from the end of the sliding window

to the beginning of the window. After getting the branch metric, α and β, the LLR calculation can be finished without write-after-read (WAR) data hazard. As a result, the latency of MAP decoder is three times sliding window size.

(52)

MEM ADDR Control INTERLEAVER_2 INTERLEAVER_1 FSM MEM_EXT MEM_YW2 MEM_AB MEM_YW1 EN_AB EN_YW1 MAP DECODER ADDR_YW1 ADDR_AB EN_YW2 ADDR_YW2 EN_EXT ADDR_EXT EN_INTER1 ADDR_INTER1 EN_INTER2 ADDR_INTER2 C_STATE IN_FULL IN_VALID MAP_IN MAP_OUT

Figure 4.6: CTC decoder block diagram

4.3.2 Pseudo Two-port Register File

In order to increase the decoding speed and reduce the size of memory, pseudo two-port register file is used to read and write memory in one cycle. In Fig. 4.9(a), extrinsic memory operates at double clock rate. Write-address (from interleaver 2) and read-address (from interleaver 1) are generated at the original clock rate. A multiplexer selects correct address according to wheather the operation is read or write. Fig. 4.9(b) illustrates the timing diagram of read & write operation. As a result, we can eliminate one 2400 × 21 = 50400 bits (2400 is block length and 21 is total bits of extrinsic data) extrinsic memory such that 26.9% memory usage (original: 50400 + 50400 + 86400 = 187200 bits) is saved. In the same way, the buffers in MAP decoder in Fig. 4.7 are also pseudo two-port register files to read & write in one cycle as shown in Fig. 4.9(c). The different between Fig. 4.9(b) and Fig. 4.9(c) is that Fig. 4.9(c) read & write at the same address in one cycle. By using this method, the buffers in MAP decoder can be replaced from two-port register file to single-port register file to reduce the hardware area and power consumption.

(53)

buffer buffer BMU BMU BMU ACS buffer LLR βdACS β ACS

Figure 4.7: MAP Decoder Block Diagram

0 1 2 3 th

α

d

β

d

β

d

β

α

d

β

BMU βd-ACS α-ACS β-ACS

β

LLR

Figure 4.8: MAP Decoding Timing Flow

4.3.3 Interleaver Architecture

For the WiMAX interleaver in Table 4.2, the modulo operation is a critical problem for clock speed. Since we know all parameters used in Table 4.2 before decoding, some ad-ditions and divisions (simplified to shifter because the divisor is 2) can be derived as constant value before decoding. To minimize the critical path of interleaver operation, two adders and two subtractors are used instead of the divider. One adder is used to accumulate P0, which adds one P0 each cycle because in our design the interleavers only

need to generate one read (write) address every cycle. Because the value of the accumu-lator ranges from 0 to 2N − 1 , the modulo operation can be simplified to one subtraction and one multiplexer.

(54)

Single-port Register File 2X clock rate Write-Address Read-Address Write-Data Read-Data

(a) Memory read & write architecture

2X CLK rate A4 A2 A6 A8 CLK Address Read Write WEN A3 A5 A7 A9 A1

(b) Read & write timing diagram for extrinsic memory

2X CLK rate A2 A1 A3 A4 CLK Address Read Write WEN

(c) Read & write timing diagram for MAP decoder Figure 4.9: Pseudo Two-port Register File

(55)

1 (1+N/2+P1)mod N jmod 4 P0 N MUX MUX D Address MUX N (1+P2)mod N (1+N/2+P3)mod N

Figure 4.10: Interleaver Architecture

4.4 Chip Implementation Result

4.4.1 Chip Specification

Based on the architectures described above, we proposed an area-efficient double-binary turbo decoder with almost regular permutation (ARP) applied on WiMAX 802.16e. The proposed CTC decoder supports 17 different block lengths (24 to 2400) including hybrid automatic repeat-request (HARQ) modes. By means of scaling factor 0.75, we use smaller sliding window size (reducing the storage size of buffer and α buffer and the number of ACS unit) and smaller iterations (increasing the decoding throughput).

The primary chip specification of the double-binary turbo decoder is given in Table 4.4, which is implemented by the cell-based design flow, and fabricated in 90-nm 1P9M standard CMOS process. In CTC decoding process, two clock domains are used in memory and datapath respectively as we described in section 4.3.2. The higher clock rate is generated by a delay lock loop (DLL) circuit, which is applied to generate internal clock whose clock frequency is four times the external frequency. The other is generated clock which is the division of the higher clock rate. The total gate count is 303K (including the additional chip input buffer for testing) and the combinational logic part is only 30K while the memory occpies more than 80% area in our design. Fig. 4.11 is the chip layout of CTC decoder. By the proposed Max-Log MAP decoder and the simplified interleaver, the core size is 1.12mm2 _{(1.4mm by 0.8mm) in 90-nm process. The operating frequency}

應用於全球互通微波存取通訊協定的面積優化雙位元迴旋渦輪解碼器

國立交通大學

電子工程學系 電子研究所碩士班

碩 士 論 文

應用於全球互通微波存取通訊協定的面積優化

雙位元迴旋渦輪解碼器

An Area-Efficient Double-Binary CTC Decoder

for WiMAX Applications

學生：胡茗智

應用於全球互通微波存取通訊協定的面積優化

雙位元迴旋渦輪解碼器

An Area-Efficient Double-Binary CTC Decoder

for WiMAX Applications

研 究 生：胡茗智

Student：Ming-Chih Hu

指導教授：李鎮宜教授

Advisor：Chen-Yi Lee

國 立 交 通 大 學

電子工程學系 電子研究所 碩士班

碩 士 論 文

應用於全球互通微波存取通訊協定的面積優化

雙位元迴旋渦輪解碼器

學生：胡茗智

指導教授：李鎮宜 教授

國立交通大學

電子工程學系 電子研究所碩士班

摘 要

An Area-Efficient Double-Binary CTC Decoder

for WiMAX Applications

Student：Ming-Chih

Hu

Advisor：Dr. Chen-Yi Lee

Department of Electronics Engineering

Institute of Electronics

National Chiao Tung University

ABSTRACT

誌

謝

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Research Motivation

1.2

Thesis Organization

Chapter 2

Trellis-based Decoding Algorithms

2.1

MAP Decoding Algorithm

2.1.1

The MAP Decoding Algorithm

2.1.2

The Log-MAP Decoding Algorithm

2.1.3

The Max-Log-MAP Decoding Algorithm

2.1.4

Sliding Window Approach

. . .

. . .

2.2

Turbo Code

2.2.1

Turbo Encoder

2.2.2

Turbo Interleaver

2.2.3

Turbo Decoder

2.3

Double-binary Convolutional Turbo Code

Decod-ing Algorithm

2.3.1

Double-binary CTC Encoder

2.3.2

Decoding Procedure for Double-binary CTC

(

)

L d

r

電子工程學系電子研究所碩士班

碩士論文

研究生：胡茗智

國立交通大學

電子工程學系電子研究所碩士班

碩士論文

指導教授：李鎮宜教授

電子工程學系電子研究所碩士班

摘要