Practical turbo decoder architecture

SISO Decoder

2.3 Practical turbo decoder architecture

Due to the permutation by interleaver and de-interleaver, the last few extrinsic infor-mation of one constituent code may be the ﬁrst few probability estimate for the other code. Every half-iteration cannot start until the required a priori information is available.

For conventional turbo decoders like Fig. 2.10, one SISO decoders is always idle as the other is in execution. Practical decoder architecture takes advantage of such property, and the two component codes will share the same SISO decoder. Fig. 2.12 shows the corresponding architecture, which consists mainly of one SISO decoder and memories.

Here r_s is the received systematic sequence (r⁽⁰⁾₀ , . . . , r⁽⁰⁾_N₋₁); r_p1 is the received ﬁrst par-ity sequence (r₀⁽¹⁾, . . . , r⁽¹⁾_N₋₁); and r_p2is the received second sequence (r₀⁽²⁾, . . . , r_N⁽²⁾₋₁). All received codewords and temporary decoding results are stored in these memories. Both

interleaver and de-interleaver are implemented as the address generators for the memory module. In fact, we can merge the memories for Le1 and Le2 into a smaller one with elaborate controls on memory access. The multiplexers determine which component code will be processed in each half-iteration. The SISO decoder usually utilizes Log-MAP or Max-Log-MAP algorithm [22] rather than the MAP algorithm [16] for less complexity. In the Log-MAP algorithm, the calculation of branch metric in (2.35) includes the channel reliability value L_c = 4E_s/N₀. To guarantee high performance, the precise SNR esti-mation for the AWGN channel is required [31]. Nevertheless, it is diﬃcult to get the real channel information. Besides, the range of the logarithmic term ln(1 + e^−|x¹^−x²^|) in (2.36) is very huge. A large lookup table is necessary for such nonlinear function. While the Max-Log-MAP algorithm is applied, the mentioned overhead can be avoided. After simplifying max^∗ to max in (2.38), the logarithmic term is removed, and the lookup table is needless. The value L_c also becomes ineﬀective because of the maximum function [32].

Although it will lead to performance degradation, the problem can be alleviated by scaling the extrinsic information [33, 34].

The bit number of quantization, also called data width, is another design issue during the ﬁxed-point implementation. It aﬀects the performance and the design area. Both α and β will accumulate vast branch metrics γ over a period of time. It is impossible to represent their values with inﬁnite bit number. On the other hand, too few data width might cause the overﬂow. We must minimize the necessary data width without performance loss. The works in [35, 36] prove that the quantities of the SISO decoder can be limited. According to the width of received data, the diﬀerence between any two γ(S_t, S_t+1) at the same time instant is bounded; the maximal diﬀerences of α(S_t) and of β(St) can be further determined; then the suﬃcient width for L(ut) can be obtained [35].

The modulo normalization [37, 38], which makes use of the bounded diﬀerences, can be applied to prevent overﬂow. As a result, we can implement a turbo decoder with reasonable overhead and good performance.

The major computation of an SISO decoder with Max-Log-MAP algorithm includes

γ in (2.35), α in (2.39), β in (2.40), and LLR in (2.41). Because the value of y^(j)_t is

±1, the corresponding circuits for γ calculation are common adders. The α, β, and LLR calculations have to ﬁnd the maximal one among several summations. Such maximum function can be achieve by add-compare-select (ACS) units. Fig. 2.13(a) demonstrates the basic ACS unit for the forward path metric calculation in Fig. 2.6(a). The comparator performs the subtraction so that the sign of diﬀerence can indicate the larger input and choose it by the multiplexer. With appropriate substitution of inputs, this functional unit can perform the backward path metric calculation. The practical SISO decoder usually adopts the sliding window method for less overhead [24]. This method does the dummy backward path metric β_d to provide reliable initialization for β in each window, and the typical process has been introduced in Fig. 2.7. We let α-ACS mean the circuits for forward path metrics of all states. Obviously, there are also β-ACS and β_d-ACS in the SISO decoder. In Fig. 2.13(b), another ACS unit for the APP value is presented. Its addition function involves α, γ, and β. There will be at least 2^m summations with respect to the same u_t. The comparator and multiplexer will choose the maximal one out of all candidates. When the maximal APP with u_t = 0 and the maximal APP with u_t = 1 are available, the LLR and the extrinsic information can be computed soon. The total circuits for LLR calculation are called LLR unit here. Since the design computes βd of Wj+1, α of W_j, and β of W_j₋₁ in the same time, the received data of three windows must

(a) ACS unit for path metrics

comparator

(b) ACS unit for log-likelihood value

Figure 2.13: Fundamental circuits for path metric and LLR calculations

be kept until all their relevant executions are ﬁnished. The input cache within the SISO decoder can facilitate the access to data. Moreover, it needs an interior buﬀer to store α temporarily. Therefore, the complete SISO decoder comprises input buﬀers, branch metric units, β_d-ACS, α-ACS, β-ACS, α buﬀer, and LLR unit.

Among all component circuits, the ACS units for recursive path metric calculation lead to the speed bottleneck of the trellis-based decoding process. The data dependency of suc-cessive trellis stages makes the pipelining technique diﬃcult to insert registers within these ACS units. Although the LLR calculation also needs the ACS unit, the corresponding data path can be shorten with the pipelining technique. Many researches improve the crit-ical path delay by modifying the ACS circuit. The design in [39] shifts the normalization circuit, and the design in [40] applies the double state technique. With the modiﬁcations, the turbo decoder can operate at higher frequency. Besides frequency, the total iteration number is essential for throughput. Fig. 2.11 indicates that the performance improvement is small as the iteration exceeds a certain number. For some erroneous blocks, there is faster convergence in their performance during the iterative process, so we can exploit an eﬃcient stopping criterion to reduce the average iteration number [21, 41, 42]. Most early stopping rules examine the diﬀerence between the temporary results of two consecutive iterations or half-iterations. If the diﬀerence is less than a given threshold, the decoding process of current data block terminates. The choice of temporary result aﬀects the it-eration number, the circuit complexity, and the performance loss. These methods work well at high SNR, and the decoder could achieve the similar performance while costing half or less iterations.

Fig. 2.14(a) shows the traditional SISO decoder architecture with three separate input buﬀers [43], and Fig. 2.14(b) illustrates its processing schedule. The SISO decoder acquires the received data in ascending order. As the data of entire window are ready, they will be sent from the input buﬀers to the β_d-ACS, α-ACS, and β-ACS; so these functional units are inactive in the ﬁrst TW cycles. Each half-iteration can be divided as follows: both δ_a and δ_b are pipeline delay time and memory access time; τ_a is the interval for initial metric

branch

Figure 2.14: Conventional SISO decoder and its schedule with three windows

calculation between receiving the ﬁrst input and producing the ﬁrst output; and τ_b is the time to output all LLR and decisions. Fig. 2.14(c) shows when the major operations of every window are executed. After all extrinsic information are written back to memory, the following half-iteration will start. These component functional units stay idle for (δ_a+ δ_b+ τ_a) across two successive half-iterations. It takes τ_b out of total execution time to generate decoding results, and the ratio is viewed as the operating eﬃciency, symbolized as η, in (2.52) during the throughput calculation.

η = τ_b

δ_a+ δ_b+ τ_a+ τ_b (2.52)

The value of δ_a, δ_b, τ_a, and τ_b are aﬀected by window length and decoder architecture.

From Fig. 2.14(b), the necessary cycles of these execution periods can be expressed as











0≤ δa, δ_b ≤ TW

3T_W ≤ τa≤ 4T_W τ_b = 3T_W,

(2.53)

where TW is the cycle number that each functional unit takes to process one window.

In general, δ_a, δ_b, and τ_a are constant for any block size, but τ_b will be in proportion to the window number. When the SISO decoder has to process κ (= N /L) windows, only τ_b becomes (κ × T_W) cycles. We assume that the summation of δ_a, δ_b, and τ_a is 4T_W, causing the η equal to κ/(κ + 4) with the traditional schedule. If the block size is large, the inﬂuence of η will be slight; otherwise, the decoding process will be ineﬃcient. The η of the conventional design is merely 42.9% with κ = 3.

The δa, δb, and τa dominate the operating eﬃciency η. A trivial solution for raising η is to shorten any of them. The work in [44] modiﬁes the way how the data are input to the SISO decoder. For each window, the received symbols are sent in descending order.

The corresponding architecture and schedule are given in Fig. 2.15. There are two input buﬀers connecting to α-ACS and β-ACS. Rather than waiting for the input buﬀers, the

branch

Figure 2.15: Modiﬁed SISO decoder with β_d and its schedule with three windows

β_d-ACS can get its required data sequence immediately and start the backward recursive operations. Compared to the conventional SISO decoder, it costs fewer storage elements and less processing time. This architecture avoids the initial redundant time, and τa

becomes

2TW ≤ τa ≤ 3TW. (2.54)

The η changes to κ/(κ + 3). For this schedule with κ = 3, its η is improved to 50%.

In [45] and [46], an initialization approach is proposed to reduce T_W cycles from τ_a. Instead of the dummy calculation, the boundary α and β from previous iteration are utilized to initialize the α and β in current iteration. Fig. 2.16 depicts the modiﬁed SISO decoder without βd operations. The branch metric units and ACS units for βd

are removed, and only one input buﬀer module is exploited to support β-ACS. However, there are extra buﬀers for previous boundary path metrics in this SISO decoder. Such additional overhead is in proportion to the window number κ, so this architecture is suitable for processing small blocks. Its major advantage is the shortened range of τ_a:

TW ≤ τa≤ 2TW. (2.55)

The general expression of η is κ/(κ + 2) now. When κ is 3, the η is 60%. This method achieves the best operating eﬃciency.

The turbo decoder will choose the architecture and schedule that lead to the greatest beneﬁts in speciﬁc application. If the decoder only deals with large blocks, the SISO decoder in Fig. 2.15 is preferred due to reasonable area and tolerable η. Conversely, the SISO decoder in Fig. 2.16 has superior η, and it may also bring about less overhead. While the application involves both small κ and large κ, we have to make a trade-oﬀ between the cost and the eﬃciency. The cycle number of one half-iteration is

δ_I = κ× TW

η = N

η. (2.56)

boundary path

metric buffer boundary path metric buffer

Figure 2.16: Modiﬁed SISO decoder without β_d and its schedule with three windows

From [47], all of the critical path delay, iteration number, and operating eﬃciency are essential for decoding speed. The throughput of a normal turbo decoder can be calculated via

2× I × (N/η) × (1/F) = F × η

2× I. (2.57)

where F is the clock frequency, and I is the iteration number. Based on these basic architectures introduced above, many designs are developed to pursue higher throughput and less decoding time [9]. Most of the previous works concerns the process of large blocks [48, 49]. The corresponding η usually approaches 100%, and the I is relatively larger than that of small blocks. For these cases, the F will be the principal factor in throughput. However, it is impossible to increase F inﬁnitely. There will be an upper bound of the maximal throughput in traditional turbo decoders. With a growing interest in higher throughput, several techniques are developed, including architecture, algorithm, and decoding ﬂow. These methodologies bring better speedup to turbo decoders, but they also pose new challenges.

Chapter 3

在文檔中運用平行架構及無競爭式交錯器之渦輪碼解碼器 (頁 38-48)