The Max-Log-MAP algorithm

CHAPTER 2 TURBO CODE

2.2 D ECODING A LGORITHMS FOR T URBO C ODE

2.2.3 The Max-Log-MAP algorithm

As we can see, the MAP algorithm involves too many exponentiations and

multiplications. These are quite complex for hardware realization. Thus, an approximation of MAP algorithm termed Max-Log-MAP algorithm [24] was derived for simple implementation of MAP decoders. Instead of calculating e^γ^k, e^α^k, and e^β^k directly, all computations are done in logarithm domain. Here we define γk, αk, and βk as transition metric, forward path metric and backward path metric respectively. γ_k can be formulated as

1 1

respectively. After substituting (2. 17), (2. 18), and (2. 19), in (2. 15) can be re-written as

By utilizing the approximation of

1 2

log(e^δ +e^δ + +e^δⁿ) max( , , , )≈ δ δ δ_n , (2. 27) can be further simplified to

( )ˆ_k

T and backward recursions that repetitively compute the

αk and βk, and can be expressed by

and

Both equations are add-compare-select (ACS) operations, which are similar to the path metric pdating of Viterbi algorithm.

2.2.

and Log-MAP algorithm under different SNR estimation fsets was made in [26].

Otherwise, Log-MAP decoder should be the aspect of coding gain.

4 SNR sensitivity of Max-Log-MAP and Log-MAP algorithm

Referring to (2.13) and its followed deductions, it’s evident that both MAP and log-MAP algorithm requires SNR estimation to obtain the value of channel reliability, i.e. Lc. Unfortunately, accurate estimation cannot be achieved easily. Several papers have discussed the effect of SNR mismatch in turbo decoding. In [25], the simulations show that about -3 to +6 dB SNR estimation offset is tolerable before significant performance degradation.

However, Max-Log-MAP algorithm is able to provide a SNR independent scheme if a priori information is initialized with a reasonable value, such as all zero’s for each state [26]. Due to the linearity of max(.) operations, the term L_c can be canceled out while computing L u( )ˆ_k . The comparison of Max-Log-MAP

Although Log-MAP algorithm provides the performance better than that of Max-Log-MAP algorithm, it suffers the risk of serious SNR mismatch offset. Thus, channel characteristics play an important role in practical implementation. It has been concluded in [26] that if channel characteristics change over time, the Max-Log-MAP decoder is suitable to be the constituent decoder in turbo decoding.

preferable in

2.3 Sliding Window Approach

As what we described in the previous section, the MAP-based algorithm (including MAP algorithm, Max-Log-MAP algorithm, and Log-MAP algorithm) requires both forward and backward path metric to calculate the log-likelihood ratio. Since the forward and backward recursions start from different initial point, the entire block message has to be received and stored for computing forward and backward recursions. Furthermore, we have to store one of the path metrics of forward or backward recursion and wait for another. These restrictions enlarge the memory requirement for hardware implementation of turbo decoder. For example, the maximum block length of 3GPP standard is 5114, which means 5114 codewords and path metrics should be stored. Besides, long output la

state if the backward recursion goes long enough. Fig. 2.5 and Fig. 2.6 shows the process of this approach in both directions and the detail operating flow is described as follows.

tency is also introduced. It limits the speed and throughput of turbo decoder design.

The main problem is that long block length can not be divided into several shot sub-blocks immediately, since the lack of boundary path metric of sub-blocks in opposite direction of input sequences will degrade the performance. Thus, a sliding window approach was proposed in [27] and later on in [28] to overcome this drawback. This approach utilizes the fact that the backward path metrics can be highly reliable even without knowing the initial

i i+1 i+2 i+3

Fig. 2.5 The process diagram of sliding window approach in the forward direction

path metric values for the true backward recursion

First, the received codeword is divided into many sub-blocks, with a sub-block length of W. W is called the convergence length with typically five times the constraint length of the encoder. For each sub-block i, the initial path metric values are inherited from the neighbor sub-blocks for both forward and backward recursion operations. Note that in Fig. 2.5 the dummy backward recursion β1 is employed to obtain the initial

β2. Although the initial condition for β1 is unknown except the last sub-block, we introduce the equal probability condition for β1 values:

( ) x

_t^j

1 , for all j 0,1,..., M

β = M =

(2. 31)

where

x

_t^j denotes the path metric of j-th state at time t, the last Trellis section of β1 , and M is equal to the total state number. During the forward recursion α proceeds in the i-th sub-block and stores these values into memory, the dummy backward recursion β1 is performed in the i+1 sub-block concurrently. As soon as β1 computation is finished, the initial metrics in the i+1 sub-block are available for β2 metrics in computation, and the corresponding branches metrics in the i-th sub-block.

Fig. 2.6 shows the process diagram of sliding window approach in the backward direction. The operation flow is similar to the forward direction type except for two forward recursions α and one backward recursion β.

β

ength code blocks of CCs. The standard solution is to add same bits at the tail of in

Fig. 2.6 The process diagram of sliding window approach in the backward direction

2.4 Tail-Biting Approach

Tail-biting convolutional codes are first developed by G. Solomon and H. C. A. van Tilborg[5] and recognized as equivalent to quasi-cyclic block codes.[6] From the strict definition of convolutional codes (CCs) it is clear that CCs can only be applied to semi-infinite sequences, i.e., encoding starts at time t = 0 in the all-zero state and goes on continuously. But almost any communication system is block-oriented, we must find methods to obtain finite l

formation sequences to force the encoder back to the all-zero state. This method can avoid the weak error protection for the last codeword bits, however it causes same rate loss due to tail bits.

Tail-biting avoids the rate loss without suffering from degraded error protection at the end of the codeword. With tail biting technique, the starting state of encoder is not necessarily

the all-zero state. It can also be any one of the other states. The fundamental idea behind state after encoding the infor

tail-biting is that the starting state should be the same as the ending

mation sequence, i.e., x₀ =x_N. In the Trellis representation of tail-biting codes only those paths that start and end at the state are valid codewords.

2.4.1 Encoding tail-biting codes using feedback encoders

Let us consider a feedforward encoder first. It is obvious that we only have to consider the last m input k0-tuples of information sequences to fulfill the tail-biting boundary conditionx₀ =x_N. But the situation is more complicated for feedback encoders. The last

encoding statex_N depends on the entire information vector u=( , ,u₀ … u_N₋₁). Thus, we must

calculate for a given information vector u the initial statex₀ that will lead to the same state after N cycle. To solve this problem, we consider the state representation:

t t

x

₊

= A x + B u

_t (2. 32) To solve the iterated function by substitution, we can find that the complete solution of (2.32) equals to the superposition of the zero-input solution and the zero-state solution .

If we demand that the state as time t=N is equal to itial statex₀, we obtain from

[ ]zs N

(2.33):

(

)

x = A + I x

(2. 34) Where I_mdenotes the m-by-m identity matrix. If a feedback encoder with certain information length N can provide an invertible matrix(A^N +I_m), the correct initial state x₀ can be calculated by knowing the zero-state responsex^{[ ]}_N^zs .

The encoding process of tail-biting convolutional code shown in Fig. 2.9 is divided into

two steps:

First, the encoder starts from the all-zero state with given information sequences to determine the zero-state response . By knowing the zero-state response, we can calculate the corresponding initial state

[ ]zs

x0 by (2.34). Second, the encoder starts from the correct initial statex₀ and a valid codeword results.

Fig. 2.7 The encoder process of tail-biting convolutional code

Since the matrix has to be invertible, not every code length is legal with a given feedback encoder. Moreover, some feedback encoder can not be tail-biting. Some detail discussion can be found in [7], [8], and[9].

(A^N +I_m)

Chapter 3 The High Speed Turbo Decoder Design I

3.1 Introduction

Presented by Berrou et al. in 1993 [1], turbo codes have been recognized as a milestone in the channel coding theory. Due to their outstanding error-correcting capabilities, turbo codes have been highly appreciated in wireless communications, where signal-to-noise ratios (SNRs) are generally low. Two commonly used soft-input–soft-output (SISO) turbo decoding algorithms are maximum a posteriori probability (MAP) algorithm [2] and soft-output Viterbi algorithm (SOVA) [4]. MAP-based turbo decoders are known to have better performance than SOVA-based turbo decoders while having slightly larger complexity.

Many researches are proposed to improve the speed of turbo decoder. Bickerstaff proposed a high radix decoder [11]; Bougard introduced a full-duplex design [12]; Urard implemented a 5 iterations series turbo decoder [16]. Their works increase the throughput by refining the architectures of the SISO decoders. The highly parallel structure might be a solution to substantial improvement, but there are two difficulties that have to be overcome.

One is the memory contention problem resulted from high-radix and multiple processing elements; the other is the critical path resided in the add-compare-select (ACS) circuit. We proposed a high speed solution that resolves these two problems by using a novel interleaving methods and modifying the MAP decoders. Some interleaving algorithms with contention-free properties have been published [9], and our design adopts the inter-block permutation (IBP) interleaver [13]. Then we exploit a high-radix MAP decoder with shorter

critical path to increase data rate [14]. The proposed turbo decoder provides both high throughput capability and outstanding energy efficiency while maintaining equivalent performance as 3GPP turbo code.

3.2 Decoder Structure

For high speed turbo decoder design, there are generally two types of architectures proposed in the state of the art. Fig 3.1 shows these architectures, the series architecture and the parallel architecture. The series architecture duplicates the same number of processing elements as iterations and each processing element decodes the codeword for only one iteration. After decoding, each processing element will pass the extrinsic value to the next element. This architecture is easy to implement but the hardware cost is very high. The parallel architecture decodes one codeword with multiple decoders. This architecture is more flexible since number of decoders varies from different specifications. The major problem of this architecture is that how to decode a block codeword with multiple decoders. The forward recursion and the backward recursion connect the whole codeword, so we should apply some techniques to separate them. In the following, we will introduce our proposed design using the parallel architecture to solve this problem.

Fig. 3.1 Block diagram of proposed turbo decoder

Fig. 3.2 shows the block diagram of proposed decoder, which consists of 32 parallel MAP decoders and 32 parallel memory sets. We separate a codeword into 32 sub-codewords with length 128. Each sub-codeword is assigned to one decoder and decoded separately.

These sub-codewords are connected by a well-designed inter-block permutation (IBP) interleaver. This method avoids the forward and backward recursion problem while using the parallel architecture. The decoding process is described as follows: first, each memory will collect a 128-bit sub-codeword from input buffer till the whole 4096-bit codeword is received.

The memory stores the received symbols and extrinsic information, which is divided into two banks to support the radix-4 design. Second, the 32 memories will deliver the required data to the 32 MAP decoders through the IBP network, which is part of the interleaver. The interleaver is implemented with the address generators in each memory and the network controller. The MAP decoders perform the primary decoding procedures, and each one is responsible for 128 bits. After 8 iterations, this design would output the decisions of current block and start to decode next block.

Fig. 3.2 Block diagram of proposed turbo decoder

3.3 Interleaver Design for High Speed Turbo Code

3.3.1 Contention-free Interleaver

To increase throughput, a log-MAP decoder is parallelized by dividing a size-N trellis

into M size-W windows (N = MW) and employing M synchronous MAP-based decoders with M separate memory banks. Interleaving latency is eliminated by writing the M values generated each clock cycle directly to their interleaved positions. However, if the interleaver is not designed carefully, two or more MAP-based decoders may require access to the same memory bank on a given clock cycle, resulting in a memory contention. Moreover, a high radix decoding structure also suffers from the memory contention problem while accessing multiple codeword symbols from memories. Fig 3.3 shows an example of memory contention problem in a parallel decoding structure. We store a codeword sequence in order in four different memory banks. It is obvious that it is a contention-free access at all different timing with pre-permutation order. But it will have the memory contention problem if we apply different interleavers. The post-permutation 1 is a contention-free interleaver design. Because every time we access four symbols, they come from different memory banks. The interleaver design of post-permutation 2 suffers two contention collisions at time t0 and t3.

Fig. 3.3 Example of a contention-free permutation

3.3.2 IBP Interleaver

The IBP interleaver in [13] favors both performance and throughput of turbo decoder.

Such method guarantees no hazards when multiple MAP decoders try to access multiple memories concurrently. The IBP interleaver consists of two steps of permutation: intra-block permutation and inter-block permutation. The first step rearranges the symbol sequences in each sub-block with the same rule. The second step swaps the sequences between blocks periodically. The destination can be derived by executing bit-wise exclusive-or between the original block index and the IBP parameter. Fig. 3.4 demonstrates an example of IBP interleaver with four sub-blocks. First, all sub-blocks are individually reordered by right rotate;

Second, they exchange data among these permuted sequences.

Fig. 3.4 An example of IBP interleaver with four sub-blocks 3.3.3 Butterfly network

The butterfly network is designed to perform the inter-block permutation in the IBP interleaver. This structure also avoids the memory contention problem between sub-blocks and reduces the circuit complexity. Fig. 4 shows the corresponding structure for above example illustrated in Fig. 3. The network is divided into two levels, and each level has one external signal to control the multiplexers. S0 and S1 will define four possible connections. In

general, the butterfly network links N memories to N MAP decoders by log2N levels of switches. Each level requires 1-bit control signal to manage its N multiplexers; the total log2N bits establish N possible connections.

Fig. 3.5 A 4x4 butterfly network for IBP interleaver 3.3.4 Double prime interleaver

All the data inside each block will be divided into two groups and be stored in the two separate memory banks. When radix-4 MAP decoders request two symbols at each cycle, these two symbols must be derived from different memory banks. This is another contention problem that should be aware of. Our design uses the double prime interleaver to resolve this problem. The double prime interleaver is constructed by two prime interleavers whose function are expressed by

(( 2 ) mod ) 2 1, is odd2 (( 2 ) mod ) 2, is even2

( ) ⁱ {

ⁱ_i ^p_{p s} ^L ^L ⁱ_i

π

^{⎢ ⎥×}^{⎣ ⎦} ^{× +}

⎢ ⎥× + ×

⎣ ⎦

=

(3. 1)

This L is the block length, and it must be an even number. Note that p must be relative prime to L/2 and s is a constant shift. Both the interleaver and de-interleaver could be expressed in (3.1) with different parameters. Double prime interleaver with well-searched parameters would outperform the interleaver in 3GPP turbo coding. Most important of all, an well-designed double prime interleaver is an fully contention-free interleaver for certain

sub-block length. For example, we can choose any factor of the sub-block length as the parallel access number and the memory bank number. It is guaranteed that a well-designed double prime interleaver is a contention-free interleaver.

3.4 High-Throughput MAP Decoders

3.4.1 Retimed radix-2x2 ACS unit

For trellis-based decoders, the branch number of conventional high-radix design increases exponentially however the branch number of the two-stage structure increases linearly. A two-stage ACS is introduced in [14] to ease the area overhead of high-radix ACS.

The complexity of ACS unit depends on the branch number, so our design prefers radix- 2 × 2 ACS to radix-4 ACS. But the critical path of two-stage structure is longer than conventional structure. The recursive property of path metric would make the pipelining method inefficient here, however, the critical path can be reduced by our proposed retiming method.

It is obvious that the ACS unit could not execute compare-select operations until addition results are ready; such data dependency restricts the operating frequency. To eliminate the dependency, the data path of ACS unit must be modified. So the proposed decoder applies the retiming technique, and Fig. 3.6 demonstrates the procedure of a retimed radix-2× 2 ACS. The first step shown in Fig. 3.6(a) is retiming of registers. Move and duplicate the registers ahead of the compare circuits, then computation order is rearranged from add-compare-select to compare-select-add. The registers have to store the summation of path metric and branch metric rather than only path metric. The second step shown in Fig. 3.6(b) is relocation of adders. Move and duplicate the adders ahead of the multiplexers; now the compare-select and addition could execute concurrently. The modified ACS unit is shown in Fig. 3.7, where the critical path becomes two consecutive compare-select operations. It would cause extra area overhead because of double registers and double adders, and the improvement of the radix-2×2 architecture could compensate for this loss. The relocated method can accomplish

not only high-speed but area-efficient solution.

Fig. 3.6 Retiming procedure of a radix 2x2 ACS unit

Fig. 3.7 A retimed radix 2x2 ACS unit

3.4.2 The circuit for log-likelihood ratio calculation

Our design adopts the modulo normalization to avoid over- flow of path metric [15].

This method requires only one more bit in the ACS unit and a simple modification inside the LLR unit; there are no specific circuits for normalization in ACS unit. Only the differences between forward path metrics and the differences between backward state metrics are significant in modulo normalization, so the LLR unit has to use these differences to calculate the log-likelihood value. Our design rearranges the computation order of log-likelihood value from circuit in Fig. 3.8(a) to circuit in Fig. 3.8(b). Although the two circuits have the same function, but original circuit may result in overflow due to the limited data width. The modified circuit could guarantee the correctness and cause no extra path delay.

Fig. 3.8 The circuit for log-likelihood calculation

3.5 Simulation Result and Chip Implementation

The proposed turbo code with code rate 1/2 could decode 4096 bits after 8 iterations, and the implementation applies maximum log-MAP algorithm with a scaling factor 0.75. The other specifications are listed in Table. 3.1. Fig. 3.10 and Fig. 3.11 shows the performance comparison between the proposed code and 3GPP turbo code. The floating point and the fixed point simulation result are both competitive to the result of 3GPP standard. However, the proposed turbo design has better distance property due to the interleaver design than the 3GPP

standard. Obviously, the 3GPP standard suffers from the error floor phenomenon more than the proposed design.

Fig. 3.9 FER performance compared with 3Gpp turbo code

Fig. 3.10 BER performance compared with 3Gpp turbo code

Table 3.1 Turbo Decoder Specification

Algorithm Max-Log MAP

ACS unit Radix 2x2 (retimed)

Code polynomial

Code Rate 1/2 (punctured)

Block length 4096(128 x 32)

在文檔中 Gbps高速渦輪碼之設計與實現 (頁 25-0)