CHAPTER 4 FIXED POINT ANALYSIS OF DUAL MODE TURBO/VITERBI
4.2 F IXED P OINT A NALYSIS OF V ITERBI D ECODER
Soft decision Viterbi decoder provides a better error correction capability. With increasing quantization level, the error probability can be further reduced with a penalty of linearly increased complexity. However, the degree of improvement will saturate as the quantization level reaches a threshold. A simulation to evaluate the performance improvement with different quantization level is done and shown in Fig. 4.7. All schemes are set to be uniform quantization and optimal step size. The BER curve with 128-level soft-input is assumed to be the performance limitation of code rate 1/2 256-state Viterbi decoding. As we can see, the improvement from the scheme with 8-level soft-input to that with 16-level is up to 0.4dB. Nevertheless, the 32-level scheme gains about only 0.2dB from 16-level scheme.
Hence we can conclude that the 16-level soft decision yields a good trade-off between performance and complexity and thus is chosen for the proposed design.
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Performance of Viterbi Decoder under different soft-input level (QPSK, Code Rate=1/2, tblen=64)
8 level 16 level 32 level 128 level
Fig. 4.7 The performance of Viterbi decoder with different quantization levels
To restrict the difference of any two states at any time is also a critical issue for Viterbi decoding. According to Viterbi algorithm, it supposes that all survivor paths will converge to the same node among the truncation length L. This assumption can be expressed by Fig. 4.8.
A principle of choosing the truncation length is introduced in section 3.3. Computing all path metrics from t=0, we can get
1 1 Then, the difference of any two path metrics can be written as
1 2 1 2
( ) ( )
k S k S
Γ − Γ = Γ − Γ ≤ BL (4. 13) where B denotes the maximum value of the branch metric.
t=0 t=k-L L t=k
Γ k-L
Γ k (S 1 ) Γ k (S 2 ) Γ 1
Γ 2
Fig. 4.8 The convergence of any two survivor paths in Viterbi algorithm
In 3GPP2 standard, the minimum code rate is 1/6. Combining with 16-level soft input, the value of B is 90. Therefore, the upper bound of path metric in our proposed design should be 5760, which means 13 bits at most are required theoretically. However, this case rarely happens. In fact, the bit-width of path metric will directly influence the size of its storage. A simulation for the cost consideration of path metric storage was done and the result is shown in Fig. 4.9. It indicates that obvious performance degradation occurs while 9-bit scheme is adopted. Therefore, we’ll use 10 bits for path metric representation in Viterbi decoder, and corresponding storage requirement should be at least 2560 bits. Finally, a performance analysis on system performance for each supported code rate is concluded in Fig. 4.10.
-5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1
Performance of Viterbi Decoder under different PM bit-width (16-level soft-input, QPSK, Code Rate=1/6, tblen=64)
SNR
Fig. 4.9 The performance of Viterbi decoder under different bit-widths of path metric
-5 -4 -3 -2 -1 0 1 2 3 4 5
Overall performance of Viterbi Decoder under different code rate (QPSK)
Code Rate 1/2 Code Rate 1/3 Code Rate 1/4 Code Rate 1/6
Fig. 4.10 The performance analysis on system performance for each kind of code rate
4.3 Summary
In this chapter, some fixed point performance analysis for internal variables is done for both operating modes. Summaries of bit-width decision for turbo mode and Viterbi mode are made in Table 4.1 and Table 4.2 respectively. It is verified that the design loss is only about 0.25dB for both turbo mode and Viterbi mode. A comparison with other similar work [19] for turbo mode is also made in Table 4.3. It shows that the results are nearly the same.
Table 4.1 Summary of bit-width decision for turbo mode
Variables Lcrk,v L(uk) ∆ γ ∆α ∆ β L u(ˆk) Lex(uk) Bounds −∞ +∞~ -8 ~ +8 40 96 96 -136 ~ 136 -124 ~ 124 Bit-width 6 (3.3) 6 (4.2) 8 (6.2) 8 (6.2) 8 (6.2) 9 (7.2) 9 (7.2)
Table 4.2 Summary of bit-width decision for Viterbi mode
Variables soft input ∆BM ∆PM
Bounds 0 ~ 15 0 ~ 90 0 ~ 5760
Bit-width 4 7 10
Table 4.3 A comparison of bit-width decision with [19] for turbo mode
Variables Lcrk,v L(uk) ∆ γ ∆α ∆ β L u(ˆk) Lex(uk) our work 6 (3.3) 6 (4.2) 8 (6.2) 8 (6.2) 8 (6.2) 9 (7.2) 9 (7.2)
[19] 5 (3.2) 6 (4.2) 5* 6* 6* 8* 8*
* required bits for integer part only.
Chapter 5
Architecture of Proposed Dual Mode Turbo/Viterbi Decoder
5.1 Architecture of Integrated Turbo/Viterbi Decoder
Because of the trellis decoding structure of both decoders, the combination takes the advantage of resource sharing in the ACS and memory unit, leading to a much compact architecture for 3GPP2 system. The proposed architecture of integrated turbo/Viterbi decoder is shown in Fig. 5.1. The shared components are represented with gray blocks. A specified input is used to switch the operating mode of the proposed design. While the turbo mode is activated, the components for Viterbi mode are all disabled by gated clock and vice versa.
This will guarantee that redundant power consumption can be avoided in both operating modes.
According to the operating mode, the input data goes through the input cache or transition metric unit (TMU, also called branch metric unit or simply BMU) of Viterbi decoder (VD) for turbo mode and Viterbi mode, respectively. In turbo mode, the sliding windowed approach introduced in section 2.3 is adopted with a sub-block length of 20. The data output of the input cache will later go through three additional TMU for data preparation.
The overall architecture consists of 24 ACS units, which are separated into 3 blocks to complete α, β1, and β2 recursions in parallel in Turbo mode. In Viterbi mode, only 16 of 24 ACS units are used for trellis decoding. The path metrics in both algorithms are obtained by accumulating branch metrics. Finally, the data output of ACS units may be imported into LLR
computation unit to do iterative decoding for turbo mode or into path metric unit (PMU) of VD so that trace back can be done periodically in Viterbi mode.
In Fig. 5.1, the memory occupies a significant area of our design. It includes input cache, forward path metric storage of turbo decoder, and interleaver/de-interleaver memory shared with survivor memory in Viterbi mode. To save chip area, time-multiplexing method is utilized to provide double memory access frequency so that all memory blocks but input cache are implemented with single-port SRAM.
SRAM
Fig. 5.1 The proposed architecture of integrated turbo/Viterbi decoder
5.2 Architecture of Turbo Decoder
The architecture of integrated turbo/Viterbi decoder operated in turbo mode is shown in Fig. 5.1, in which all disabled blocks and unconnected lines are represented by dotted lines.
Although iterative decoding with ten iterations provides 0.2dB coding gain compared with that with six iterations, the former scheme is not adopted in our design due to its longer output latency and higher power dissipation. Detail operating flow is described as follows.
SRAM
Fig. 5.2 The architecture of integrated turbo/Viterbi decoder in turbo mode
5.2.1 Single MAP Decoder design
In general, the block diagram of turbo decoder can be expressed as Fig. 2.4, which consists of two MAP decoders, two interleavers, and one de-interleaver. To implement the turbo decoder according to this diagram directly is too complicated and not efficient. Since two constituent decoders are identical, a single MAP decoder is proposed to not only reduce design cost but also simplify the control logic for two SISO decoders.
To achieve a single MAP decoder architecture, a full decoding iteration is split into two phases. In the first phase for the SISO decoder1, the MAP decoder reads systematic data, parity data and extrinsic values which come from the other decoder after de-interleaving. The output extrinsic data are stored in memory. As in the second phase for the SISO decoder2, the MAP decoder copes with permuted systematic data, parity data from the second encoder, and a priori values which are the interleaved extrinsic output from SISO decoder1. A simplified architecture of Fig. 2.4 is illustrated in Fig. 5.3. Note that there is an additional input cache and only one memory block for extrinsic data storage here. These will be introduced later in sub-sections 5.2.2 and 5.2.6 respectively.
r
0Fig. 5.3 A single MAP decoder architecture for turbo decoding
5.2.2 Cache design
As what we mentioned in the previous section, a sliding windowed approach is adopted in our design. Referring to Fig. 2.7, the data of each sub-block needs to be read three times by ACS-β1, ACS-α, and ACS-β2 units separately. Thus, an input cache is implemented to reduce repeated accesses of external memory, and power-down can also be achieved. The cache keeps three consecutive sub-blocks, and is equipped with one writing port for data updating and three reading ports for ACS units. As shown in Fig. 5.4, the cache is implemented by a dual-port SRAM with the size of 60x24 bits, and uses time multiplexed approach to provide four data ports. A set of additional registers is employed at output port-2 to guarantee that all outputs of the cache will be synchronized at the same clock rising edge. Detail timing chart is shown in Fig. 5.5.
dual-port 60×24 memory
2x clock rate
D
I O0
O1 O2
Input port-0 Output port-1
Output port-2 Output port-3 Address-0
Address-1
Address-2 Address-3
Fig. 5.4 The input cache architecture
Port A
Port B RD α RD β2
data for computing β1
data for computing β2
data for computing α
clock for
Fig. 5.5 The detail timing chart of the proposed input cache
5.2.3 Transition Metric Unit (TMU)
In 3GPP2 standard, eight branch metrics are required for LLR computation. According to the formula listed in (2. 14), each branch metric is obtained by adding or subtracting received symbols and a priori data together depending on the branch codewords. Implementing it directly will consume lots of adders and subtractors. A simple method to overcome this problem is to use an equivalent formula listed in (5. 1)
( )
the difference of branch metrics. The modified architecture of TMU is shown in Fig. 5.6.rs
Fig. 5.6 The TMU architecture for turbo decoder
5.2.4 ACS Unit
The trellis diagram of both turbo code and convolutional code can be decomposed into many basic butterfly units. Each one can be implemented by the ACS units. Due to accumulating the branch metrics, the normalization is necessary to prevent error due to overflow. Several methods of the normalization had been developed [20]. These include reset, variable shift, fixed shift, and modulo normalization [21]. In our proposed design, the last one scheme is adopted. The key idea of the modulo normalization is not to avoid the overflow, but to accommodate the overflow. Therefore, it can rescale the path metrics locally, and can be implemented by the 2’s complement adders. The penalty of the modulo normalization is that the extra one bit is required for all components in each ACS unit. Compared with other normalization schemes, its overhead is quite small.
The design of ACS unit should be compatible for both Max-Log-MAP algorithm and Viterbi algorithm. Referring to (2. 23) and (2. 24), we can easily find that both α and β recursions perform the same add-compare-select operations as that in Viterbi decoder. The
ACS architecture for dual mode turbo/Viterbi decoder is shown in Fig. 5.7. The detail bit-width information is also shown here.
+ + +
9
9 11 11
11
11
2
11 α0
γ0
α1 γ1
Fig. 5.7 The ACS architecture for dual mode turbo/Viterbi decoder
5.2.5 LLR unit
LLR unit is a function-block that responses to compute a posteriori LLR and extrinsic information according to the path metrics and branch metrics. The proposed architecture is shown in Fig. 5.8. By gathering the forward path metrics from SRAM storing α, the backward path metrics from ACS-β2 units and branch metrics from TMU-β2, the LLR for each branch can be figured out in 16 LLR-unit cells. After taking the maximum LLRs for both uk=+1 and -1, the a posteriori LLR can be obtained according to (2. 22). The extrinsic information is also acquired base on (2. 15). Responding to section 4.1.2, the extrinsic data should be bounded with 4.2 format for cost consideration. Any values exceeding the range will be pull back to +7.75 or -8.00 according to their sign bits. Finally, an additional Last-in-first-out (LIFO) is used for symbol re-ordering due to backward LLR computations.
ACSACSACS
Fig. 5.8 The LLR unit architecture for turbo decoder
5.2.6 Interleaver design
The embedded interleaver/de-interleaver that supports a maximum block length of 20,730 is designed to reduce the amount of time required to permute symbols. Although interleaver and de-interleaver must co-exist in turbo decoder, memory sharing between them can be realized because of the single MAP decoder design. As what we mentioned in section 5.2.1, in the first phase for the SISO decoder1, the single MAP decoder reads a priori information from the memory in sequence. Once the extrinsic data is generated from the MAP decoder, it can be written into the memory in sequence, too. By the similar way, in the second phase for the SISO decoder2, the MAP decoder reads a priori information from the memory in permuted order. Once the extrinsic data is generated from the MAP decoder, it can be written into the memory in permuted order, too. The key point is that since the data is read first, there is no conflict while updating data. This idea can be illustrated with Fig. 5.9. In order to write and read data from memory at the same time, two memory blocks are required for the block interleaver in traditional architecture. However, this is a heavy burden for the chip implementation because of the large block length. To write and read data at the same
time without increasing memory size, the time multiplexed approach is utilized to provide two data ports. Therefore, the clock rate for SRAM storing extrinsic symbols should be twice of that in the MAP decoder.
MAP Decoder
Shared memory for interleaver/
deinterleaver
Read Address Generator
Write Address Generator
Fig. 5.9 The architecture of shared memory design in turbo decoder
The permutation realized by address management operates on the fly with the MAP decoder, which results in no additional delay within each iteration. Fig. 5.10 shows the address generator for interleaving operation. A large memory of address table with a size of 310.95kb can be eliminated by the on the fly address calculator. However, in 3GPP2 standard, the generator may produce invalid addresses and stall the MAP decoder, which will introduce redundant latency and thus decrease the throughput rate. This problem can be solved by a duplicated address generator since it is guaranteed that there must be at least one valid address among any two successive permuted addresses. While an invalid address is observed, the address from the other generator is adopted. Two SRAMs of 20,730 words are included in the proposed decoder to store systematic and extrinsic symbols respectively. And both of them are single-port structures to avoid the area overhead of multi-port memory.
ROM
Fig. 5.10 The address generator for the interleaver of 3GPP2 turbo decoder
5.3 Architecture of Viterbi Decoder
In Viterbi mode, 256 states trellis decoding is implemented with 1/2, 1/3, 1/4, and 1/6 code rate. The architecture of the Viterbi decoder is based on the accomplished hardware components in the turbo decoder. Since the maximum throughput rate specified in 3GPP2 standard is not so critical, the fully parallel architecture is not necessary here. Fig. 5.11 shows the architecture of integrated turbo/Viterbi decoder operated in Viterbi mode. 16 of 24 ACS units included in ACS-α and ACS-β1 are employed to finish 256 ACS operations within 16 cycles. The memory for interleaver in turbo decoder is treated as the survivor memory.
Detailed operating flow is described as follows.
SRAM
Fig. 5.11 The architecture of integrated turbo/Viterbi decoder in Viterbi mode
5.3.1 Transition Metric Unit (TMU)
Due to limited number of ACS units, the ACS operations for each time index in the trellis have to be separated into 16 cycles. Thus, different branch codewords may occur in every ACS unit according to the different cycle count. A TMU cell is designed for Viterbi decoding to deal with this problem. The architecture of TMU cell is shown in Fig. 5.12. The
TMU controller exports miscellaneous codewords base on current operating code rate and cycle count. After that, the transition metrics are obtained by accumulating the difference of soft-input symbols and codewords. In the TMU, there are 32 TMU cells assigned for 16 ACS units to complete all branch metric computations.
csa csa
Fig. 5.12 The architecture of TMU cell
5.3.2 Survivor Memory Management
Several survivor memory management methods had been introduced in [22]. Among them, a modified 3-pointer even algorithm is employed to achieve high-speed trace-back operation. By this method, the amount of the survivor memory required must be triple of that of traditional scheme. These memory blocks are distinguished into six banks; each one has a length of L/2 where L is the truncation length. The basic idea is illustrated in Fig. 5.13. In this graph, one WRITE (WR), two TRACE-BACK (TB) and one DECODE (DC) operations are
preceded in parallel. The detail descriptions are listed as follows.
WRITE (WR)
The 16 decision bits made by ACS units are written into survivor memory. For each cycle, the WRITE pointer is moved forwardly to avoid data conflicting. To complete 256-state decision bit selections, the WRITE operation is also divided into 16 cycles.
TRACE-BACK (TB1 and TB2)
The TRACK-BACK operation starts from the time index t=3L/2. Since each memory bank has a length of L/2, two TRACE-BACK operations must be performed in two memory banks to achieve trace-back method before decoding. In this step, the pointer is moved backwardly. The decision bit, Dt, is chosen from 256 survivor paths according to the method introduced in section 3.3. A track-backed state St-1 is obtained by St-1 = (St
<<1) | Dt, where << denotes the left shift operation and | denotes the OR operations.
DECODE (DC)
An additional DECODE operation is used to finish Viterbi decoding after the completeness of TRACK-BACK in the previous bank. The action of DECODE operation is exactly the same as that of TRACK-BACK operation. Because of backwardly decoding, the decoding sequence is in reverse order and thus an additional LIFO buffer of length L/2 is required to perform bit re-ordering.
Based on the 3-pointer even algorithm, the modified architecture takes 16 cycles to WRITE, 2 cycles to TRACE-BACK, 1 cycle to DECODE, and thus totally 19 cycles to realize Viterbi decoding for each time section in trellis diagram. For clearness, the architecture of Survivor memory management is shown in Fig. 5.14.
idle
Fig. 5.13 The 3-pointer even algorithm for survivor memory management
ACSACSACSACS
Fig. 5.14 The architecture of survivor memory management
5.4 Summary
In this chapter, we have presented the architecture of our proposed design and introduced all components for each mode separately. In turbo mode, the kernel of MAP decoder, including ACS units, can be operated at a lower clock rate due to two-phase clock design. A dual-port input cache is embedded to reduce times of external memory access. Both features make power consumption be even lowered down while operating in turbo mode. Moreover, the efficient interleaver design removes the redundant memory block employed in interleaver/de-interleaver. In Viterbi mode, a modified 3-pointer even algorithm is used to increase the throughput rate. Based on 16 shared ACS units, each decoded bit can be obtained every 19 cycles.
Chapter 6
Chip Implementation
6.1 Chip specification
The decoder is implemented by the cell-based design flow, and fabricated in a 0.18 µm 1P6M standard CMOS process. In turbo decoding mode, two clock domains are used in memory and datapath respectively. The lower clock rate is achieved by clock gated from the input clock. The double clock rate provides the memory with higher bandwidth, and the single-port memory is sufficient in the proposed design except the cache memory. The chip size is 11.56mm2 with the core size of 7.29mm2. The total gate count is about 115k gates including the path metric memory for Viterbi decoder. Three single-port and one dual-port SRAM are embedded in the chip with a total size of 251.6kb. The maximum IR drop that
The decoder is implemented by the cell-based design flow, and fabricated in a 0.18 µm 1P6M standard CMOS process. In turbo decoding mode, two clock domains are used in memory and datapath respectively. The lower clock rate is achieved by clock gated from the input clock. The double clock rate provides the memory with higher bandwidth, and the single-port memory is sufficient in the proposed design except the cache memory. The chip size is 11.56mm2 with the core size of 7.29mm2. The total gate count is about 115k gates including the path metric memory for Viterbi decoder. Three single-port and one dual-port SRAM are embedded in the chip with a total size of 251.6kb. The maximum IR drop that