CHAPTER 5 ARCHITECTURE OF PROPOSED DUAL MODE TURBO/VITERBI
5.2 A RCHITECTURE OF T URBO D ECODER
The architecture of integrated turbo/Viterbi decoder operated in turbo mode is shown in Fig. 5.1, in which all disabled blocks and unconnected lines are represented by dotted lines.
Although iterative decoding with ten iterations provides 0.2dB coding gain compared with that with six iterations, the former scheme is not adopted in our design due to its longer output latency and higher power dissipation. Detail operating flow is described as follows.
SRAM
Fig. 5.2 The architecture of integrated turbo/Viterbi decoder in turbo mode
5.2.1 Single MAP Decoder design
In general, the block diagram of turbo decoder can be expressed as Fig. 2.4, which consists of two MAP decoders, two interleavers, and one de-interleaver. To implement the turbo decoder according to this diagram directly is too complicated and not efficient. Since two constituent decoders are identical, a single MAP decoder is proposed to not only reduce design cost but also simplify the control logic for two SISO decoders.
To achieve a single MAP decoder architecture, a full decoding iteration is split into two phases. In the first phase for the SISO decoder1, the MAP decoder reads systematic data, parity data and extrinsic values which come from the other decoder after de-interleaving. The output extrinsic data are stored in memory. As in the second phase for the SISO decoder2, the MAP decoder copes with permuted systematic data, parity data from the second encoder, and a priori values which are the interleaved extrinsic output from SISO decoder1. A simplified architecture of Fig. 2.4 is illustrated in Fig. 5.3. Note that there is an additional input cache and only one memory block for extrinsic data storage here. These will be introduced later in sub-sections 5.2.2 and 5.2.6 respectively.
r
0Fig. 5.3 A single MAP decoder architecture for turbo decoding
5.2.2 Cache design
As what we mentioned in the previous section, a sliding windowed approach is adopted in our design. Referring to Fig. 2.7, the data of each sub-block needs to be read three times by ACS-β1, ACS-α, and ACS-β2 units separately. Thus, an input cache is implemented to reduce repeated accesses of external memory, and power-down can also be achieved. The cache keeps three consecutive sub-blocks, and is equipped with one writing port for data updating and three reading ports for ACS units. As shown in Fig. 5.4, the cache is implemented by a dual-port SRAM with the size of 60x24 bits, and uses time multiplexed approach to provide four data ports. A set of additional registers is employed at output port-2 to guarantee that all outputs of the cache will be synchronized at the same clock rising edge. Detail timing chart is shown in Fig. 5.5.
dual-port 60×24 memory
2x clock rate
D
I O0
O1 O2
Input port-0 Output port-1
Output port-2 Output port-3 Address-0
Address-1
Address-2 Address-3
Fig. 5.4 The input cache architecture
Port A
Port B RD α RD β2
data for computing β1
data for computing β2
data for computing α
clock for
Fig. 5.5 The detail timing chart of the proposed input cache
5.2.3 Transition Metric Unit (TMU)
In 3GPP2 standard, eight branch metrics are required for LLR computation. According to the formula listed in (2. 14), each branch metric is obtained by adding or subtracting received symbols and a priori data together depending on the branch codewords. Implementing it directly will consume lots of adders and subtractors. A simple method to overcome this problem is to use an equivalent formula listed in (5. 1)
( )
the difference of branch metrics. The modified architecture of TMU is shown in Fig. 5.6.rs
Fig. 5.6 The TMU architecture for turbo decoder
5.2.4 ACS Unit
The trellis diagram of both turbo code and convolutional code can be decomposed into many basic butterfly units. Each one can be implemented by the ACS units. Due to accumulating the branch metrics, the normalization is necessary to prevent error due to overflow. Several methods of the normalization had been developed [20]. These include reset, variable shift, fixed shift, and modulo normalization [21]. In our proposed design, the last one scheme is adopted. The key idea of the modulo normalization is not to avoid the overflow, but to accommodate the overflow. Therefore, it can rescale the path metrics locally, and can be implemented by the 2’s complement adders. The penalty of the modulo normalization is that the extra one bit is required for all components in each ACS unit. Compared with other normalization schemes, its overhead is quite small.
The design of ACS unit should be compatible for both Max-Log-MAP algorithm and Viterbi algorithm. Referring to (2. 23) and (2. 24), we can easily find that both α and β recursions perform the same add-compare-select operations as that in Viterbi decoder. The
ACS architecture for dual mode turbo/Viterbi decoder is shown in Fig. 5.7. The detail bit-width information is also shown here.
+ + +
9
9 11 11
11
11
2
11 α0
γ0
α1 γ1
Fig. 5.7 The ACS architecture for dual mode turbo/Viterbi decoder
5.2.5 LLR unit
LLR unit is a function-block that responses to compute a posteriori LLR and extrinsic information according to the path metrics and branch metrics. The proposed architecture is shown in Fig. 5.8. By gathering the forward path metrics from SRAM storing α, the backward path metrics from ACS-β2 units and branch metrics from TMU-β2, the LLR for each branch can be figured out in 16 LLR-unit cells. After taking the maximum LLRs for both uk=+1 and -1, the a posteriori LLR can be obtained according to (2. 22). The extrinsic information is also acquired base on (2. 15). Responding to section 4.1.2, the extrinsic data should be bounded with 4.2 format for cost consideration. Any values exceeding the range will be pull back to +7.75 or -8.00 according to their sign bits. Finally, an additional Last-in-first-out (LIFO) is used for symbol re-ordering due to backward LLR computations.
ACSACSACS
Fig. 5.8 The LLR unit architecture for turbo decoder
5.2.6 Interleaver design
The embedded interleaver/de-interleaver that supports a maximum block length of 20,730 is designed to reduce the amount of time required to permute symbols. Although interleaver and de-interleaver must co-exist in turbo decoder, memory sharing between them can be realized because of the single MAP decoder design. As what we mentioned in section 5.2.1, in the first phase for the SISO decoder1, the single MAP decoder reads a priori information from the memory in sequence. Once the extrinsic data is generated from the MAP decoder, it can be written into the memory in sequence, too. By the similar way, in the second phase for the SISO decoder2, the MAP decoder reads a priori information from the memory in permuted order. Once the extrinsic data is generated from the MAP decoder, it can be written into the memory in permuted order, too. The key point is that since the data is read first, there is no conflict while updating data. This idea can be illustrated with Fig. 5.9. In order to write and read data from memory at the same time, two memory blocks are required for the block interleaver in traditional architecture. However, this is a heavy burden for the chip implementation because of the large block length. To write and read data at the same
time without increasing memory size, the time multiplexed approach is utilized to provide two data ports. Therefore, the clock rate for SRAM storing extrinsic symbols should be twice of that in the MAP decoder.
MAP Decoder
Shared memory for interleaver/
deinterleaver
Read Address Generator
Write Address Generator
Fig. 5.9 The architecture of shared memory design in turbo decoder
The permutation realized by address management operates on the fly with the MAP decoder, which results in no additional delay within each iteration. Fig. 5.10 shows the address generator for interleaving operation. A large memory of address table with a size of 310.95kb can be eliminated by the on the fly address calculator. However, in 3GPP2 standard, the generator may produce invalid addresses and stall the MAP decoder, which will introduce redundant latency and thus decrease the throughput rate. This problem can be solved by a duplicated address generator since it is guaranteed that there must be at least one valid address among any two successive permuted addresses. While an invalid address is observed, the address from the other generator is adopted. Two SRAMs of 20,730 words are included in the proposed decoder to store systematic and extrinsic symbols respectively. And both of them are single-port structures to avoid the area overhead of multi-port memory.
ROM
Fig. 5.10 The address generator for the interleaver of 3GPP2 turbo decoder