• 沒有找到結果。

Practical turbo decoder architecture

SISO Decoder

2.3 Practical turbo decoder architecture

Due to the permutation by interleaver and de-interleaver, the last few extrinsic infor-mation of one constituent code may be the first few probability estimate for the other code. Every half-iteration cannot start until the required a priori information is available.

For conventional turbo decoders like Fig. 2.10, one SISO decoders is always idle as the other is in execution. Practical decoder architecture takes advantage of such property, and the two component codes will share the same SISO decoder. Fig. 2.12 shows the corresponding architecture, which consists mainly of one SISO decoder and memories.

Here rs is the received systematic sequence (r(0)0 , . . . , r(0)N−1); rp1 is the received first par-ity sequence (r0(1), . . . , r(1)N−1); and rp2is the received second sequence (r0(2), . . . , rN(2)−1). All received codewords and temporary decoding results are stored in these memories. Both

interleaver and de-interleaver are implemented as the address generators for the memory module. In fact, we can merge the memories for Le1 and Le2 into a smaller one with elaborate controls on memory access. The multiplexers determine which component code will be processed in each half-iteration. The SISO decoder usually utilizes Log-MAP or Max-Log-MAP algorithm [22] rather than the MAP algorithm [16] for less complexity. In the Log-MAP algorithm, the calculation of branch metric in (2.35) includes the channel reliability value Lc = 4Es/N0. To guarantee high performance, the precise SNR esti-mation for the AWGN channel is required [31]. Nevertheless, it is difficult to get the real channel information. Besides, the range of the logarithmic term ln(1 + e−|x1−x2|) in (2.36) is very huge. A large lookup table is necessary for such nonlinear function. While the Max-Log-MAP algorithm is applied, the mentioned overhead can be avoided. After simplifying max to max in (2.38), the logarithmic term is removed, and the lookup table is needless. The value Lc also becomes ineffective because of the maximum function [32].

Although it will lead to performance degradation, the problem can be alleviated by scaling the extrinsic information [33, 34].

The bit number of quantization, also called data width, is another design issue during the fixed-point implementation. It affects the performance and the design area. Both α and β will accumulate vast branch metrics γ over a period of time. It is impossible to represent their values with infinite bit number. On the other hand, too few data width might cause the overflow. We must minimize the necessary data width without performance loss. The works in [35, 36] prove that the quantities of the SISO decoder can be limited. According to the width of received data, the difference between any two γ(St, St+1) at the same time instant is bounded; the maximal differences of α(St) and of β(St) can be further determined; then the sufficient width for L(ut) can be obtained [35].

The modulo normalization [37, 38], which makes use of the bounded differences, can be applied to prevent overflow. As a result, we can implement a turbo decoder with reasonable overhead and good performance.

The major computation of an SISO decoder with Max-Log-MAP algorithm includes

γ in (2.35), α in (2.39), β in (2.40), and LLR in (2.41). Because the value of y(j)t is

±1, the corresponding circuits for γ calculation are common adders. The α, β, and LLR calculations have to find the maximal one among several summations. Such maximum function can be achieve by add-compare-select (ACS) units. Fig. 2.13(a) demonstrates the basic ACS unit for the forward path metric calculation in Fig. 2.6(a). The comparator performs the subtraction so that the sign of difference can indicate the larger input and choose it by the multiplexer. With appropriate substitution of inputs, this functional unit can perform the backward path metric calculation. The practical SISO decoder usually adopts the sliding window method for less overhead [24]. This method does the dummy backward path metric βd to provide reliable initialization for β in each window, and the typical process has been introduced in Fig. 2.7. We let α-ACS mean the circuits for forward path metrics of all states. Obviously, there are also β-ACS and βd-ACS in the SISO decoder. In Fig. 2.13(b), another ACS unit for the APP value is presented. Its addition function involves α, γ, and β. There will be at least 2m summations with respect to the same ut. The comparator and multiplexer will choose the maximal one out of all candidates. When the maximal APP with ut = 0 and the maximal APP with ut = 1 are available, the LLR and the extrinsic information can be computed soon. The total circuits for LLR calculation are called LLR unit here. Since the design computes βd of Wj+1, α of Wj, and β of Wj−1 in the same time, the received data of three windows must

(a) ACS unit for path metrics

comparator

(b) ACS unit for log-likelihood value

Figure 2.13: Fundamental circuits for path metric and LLR calculations

be kept until all their relevant executions are finished. The input cache within the SISO decoder can facilitate the access to data. Moreover, it needs an interior buffer to store α temporarily. Therefore, the complete SISO decoder comprises input buffers, branch metric units, βd-ACS, α-ACS, β-ACS, α buffer, and LLR unit.

Among all component circuits, the ACS units for recursive path metric calculation lead to the speed bottleneck of the trellis-based decoding process. The data dependency of suc-cessive trellis stages makes the pipelining technique difficult to insert registers within these ACS units. Although the LLR calculation also needs the ACS unit, the corresponding data path can be shorten with the pipelining technique. Many researches improve the crit-ical path delay by modifying the ACS circuit. The design in [39] shifts the normalization circuit, and the design in [40] applies the double state technique. With the modifications, the turbo decoder can operate at higher frequency. Besides frequency, the total iteration number is essential for throughput. Fig. 2.11 indicates that the performance improvement is small as the iteration exceeds a certain number. For some erroneous blocks, there is faster convergence in their performance during the iterative process, so we can exploit an efficient stopping criterion to reduce the average iteration number [21, 41, 42]. Most early stopping rules examine the difference between the temporary results of two consecutive iterations or half-iterations. If the difference is less than a given threshold, the decoding process of current data block terminates. The choice of temporary result affects the it-eration number, the circuit complexity, and the performance loss. These methods work well at high SNR, and the decoder could achieve the similar performance while costing half or less iterations.

Fig. 2.14(a) shows the traditional SISO decoder architecture with three separate input buffers [43], and Fig. 2.14(b) illustrates its processing schedule. The SISO decoder acquires the received data in ascending order. As the data of entire window are ready, they will be sent from the input buffers to the βd-ACS, α-ACS, and β-ACS; so these functional units are inactive in the first TW cycles. Each half-iteration can be divided as follows: both δa and δb are pipeline delay time and memory access time; τa is the interval for initial metric

branch

Figure 2.14: Conventional SISO decoder and its schedule with three windows

calculation between receiving the first input and producing the first output; and τb is the time to output all LLR and decisions. Fig. 2.14(c) shows when the major operations of every window are executed. After all extrinsic information are written back to memory, the following half-iteration will start. These component functional units stay idle for a+ δb+ τa) across two successive half-iterations. It takes τb out of total execution time to generate decoding results, and the ratio is viewed as the operating efficiency, symbolized as η, in (2.52) during the throughput calculation.

η = τb

δa+ δb+ τa+ τb (2.52)

The value of δa, δb, τa, and τb are affected by window length and decoder architecture.

From Fig. 2.14(b), the necessary cycles of these execution periods can be expressed as















0≤ δa, δb ≤ TW

3TW ≤ τa≤ 4TW τb = 3TW,

(2.53)

where TW is the cycle number that each functional unit takes to process one window.

In general, δa, δb, and τa are constant for any block size, but τb will be in proportion to the window number. When the SISO decoder has to process κ (= N /L) windows, only τb becomes (κ × TW) cycles. We assume that the summation of δa, δb, and τa is 4TW, causing the η equal to κ/(κ + 4) with the traditional schedule. If the block size is large, the influence of η will be slight; otherwise, the decoding process will be inefficient. The η of the conventional design is merely 42.9% with κ = 3.

The δa, δb, and τa dominate the operating efficiency η. A trivial solution for raising η is to shorten any of them. The work in [44] modifies the way how the data are input to the SISO decoder. For each window, the received symbols are sent in descending order.

The corresponding architecture and schedule are given in Fig. 2.15. There are two input buffers connecting to α-ACS and β-ACS. Rather than waiting for the input buffers, the

branch

Figure 2.15: Modified SISO decoder with βd and its schedule with three windows

βd-ACS can get its required data sequence immediately and start the backward recursive operations. Compared to the conventional SISO decoder, it costs fewer storage elements and less processing time. This architecture avoids the initial redundant time, and τa

becomes

2TW ≤ τa ≤ 3TW. (2.54)

The η changes to κ/(κ + 3). For this schedule with κ = 3, its η is improved to 50%.

In [45] and [46], an initialization approach is proposed to reduce TW cycles from τa. Instead of the dummy calculation, the boundary α and β from previous iteration are utilized to initialize the α and β in current iteration. Fig. 2.16 depicts the modified SISO decoder without βd operations. The branch metric units and ACS units for βd

are removed, and only one input buffer module is exploited to support β-ACS. However, there are extra buffers for previous boundary path metrics in this SISO decoder. Such additional overhead is in proportion to the window number κ, so this architecture is suitable for processing small blocks. Its major advantage is the shortened range of τa:

TW ≤ τa≤ 2TW. (2.55)

The general expression of η is κ/(κ + 2) now. When κ is 3, the η is 60%. This method achieves the best operating efficiency.

The turbo decoder will choose the architecture and schedule that lead to the greatest benefits in specific application. If the decoder only deals with large blocks, the SISO decoder in Fig. 2.15 is preferred due to reasonable area and tolerable η. Conversely, the SISO decoder in Fig. 2.16 has superior η, and it may also bring about less overhead. While the application involves both small κ and large κ, we have to make a trade-off between the cost and the efficiency. The cycle number of one half-iteration is

δI = κ× TW

η = N

η. (2.56)

boundary path

metric buffer boundary path metric buffer

Figure 2.16: Modified SISO decoder without βd and its schedule with three windows

From [47], all of the critical path delay, iteration number, and operating efficiency are essential for decoding speed. The throughput of a normal turbo decoder can be calculated via

N

2× I × (N/η) × (1/F) = F × η

2× I. (2.57)

where F is the clock frequency, and I is the iteration number. Based on these basic architectures introduced above, many designs are developed to pursue higher throughput and less decoding time [9]. Most of the previous works concerns the process of large blocks [48, 49]. The corresponding η usually approaches 100%, and the I is relatively larger than that of small blocks. For these cases, the F will be the principal factor in throughput. However, it is impossible to increase F infinitely. There will be an upper bound of the maximal throughput in traditional turbo decoders. With a growing interest in higher throughput, several techniques are developed, including architecture, algorithm, and decoding flow. These methodologies bring better speedup to turbo decoders, but they also pose new challenges.

Chapter 3