L0,L1 unit - Hardware architecture - 具計算效率且節省記憶體的第三代行動通訊渦輪碼解碼器

6.1 Hardware architecture

6.1.4 L0,L1 unit

Figure 6.7: (a) BetaChange hardware (b) LineChange hardware

6.1.4 L0,L1 unit

The equation of a-posterior LLR (L u y is rewritten below: _k )

(

) (

)

( , ) ( , )

1 1

( ) max ( ) ( , ) ( ) max ( ) ( , ) ( )

1 0

k s s k k k s s k k k

uk uk

L u y A s s s B s A s s s B s

L L

-¢ Þ ¢ Þ

=-¢ ¢ ¢ ¢

= + G + - + G +

-where ( , )

(

)

Figure 6.8: Nesting max operations for L1 hardware

When 1, 0L L are calculated, we can compute (L u y . If (_k ) L u y is bigger than _k )

or equal to zero, the decoded bit is “1”. If (L u y is smaller than to zero, the _k ) decoded bit is “0”.

6.1.5 Complete turbo decoder architecture

The block diagram of entire turbo decoder hardware is shown in Figure 6.9.

y

Figure 6.9: The block diagram of total turbo decoder hardware

The solid line rectangles are computational unit or control unit and the dotted line rectangles are memories. The functions of all blocks are as follows:

1. Cntl unit: Cntl unit is used to control where to write or read and when to write or read of all memories. We pack the interleaver/deinterleaver memory into it because the functions of them are to provide the addresses for mapping the normal order systematic bits y to interleaved order systematic bits '_ks y and the address for _ks writing extrinsic LLRs.

2. ys,yp,y’p memories: These memories are used to store the transmitted systematic bits, the parity bits corresponding to the normal order information bits, the interleaved systematic bits, the parity bits corresponding to the interleaved order information bits, respectively. Their depths all equal toL_HW .

3. MemExtIL, MemExt memories: These two memories are used to provide and store a-priori LLR and the extrinsic information. In first half iteration, the decoder deal with the natural order received bits, ys,yp, and MemExt provides a-priori LLRs and MemExtIL receives the extrinsic information produced in this half iteration. In the last half iteration, the decoder deal with the interleaving order received bits y’s,y’p, MemExILt provides a-priori LLRs now and MemExt receives the extrinsic information produced in this half iteration.

4. MemA memory: MemA memory is used to store the backward state metrics when calculating backwardly and to provide the state metrics to L1, L0 processors which are in the computational core when the decoder calculates the forward state metrics.

5. MemB: MemB memory is the initialization memory required for halfway. It stores

the backward state metrics periodically and provides these state metrics as the initialization state metrics for the corresponding backward calculations at next iteration.

6.2 Memory requirements

In previous section, we list all the memories needed in the decoder hardware.

Along with the specification decided in section 5.3 and section 5.4, we can calculate all the memory capacities. We arrange all the numbers in Table 6.1.

Bit width

memory capacity (bits)

(depth*width*banks)

memory capacity

(bytes) Input memory_b ys,yp,y'p 7 5120*7*3=107520 13440

Input memory_d ys,yp,y'p 7 5120*7*3=107520 13440

Initialization memory MemB 9 64*8*9=4608 576

Backward metrics memory MemA 9 256*8*9=18432 2304 interleaver/deinterleaver ILMem

/DIMem 13 5114*13*2=132964 16620.5 Extrinsic info. Memory MemExt

/MemExtIL 7 5114*7*2=71596 8949.5

Total 442640 55330

Total

(excluding input memory_b) 335120 41890

Table 6.1: memory capacity

The first column is the memory name with respect to their function and the second column is the corresponding name in the Figure 6.9. In the second and third row, there

are two sets of memories for ,y y y . The first memory, input memory_b, is used as _s _p, '_p buffer which stores data for the next block while decoding the data in memory_d.

When finish decoding the data in memory_d, the roles of these two memories exchange. Usually the data buffer (memory_b) is not counted in the decoder hardware due to it does not provide data to the decoder in decoding process though it is essential. Therefore we list the memory capacity without buffer in the last row for reference.

6.3 Decoding Process

Before the decoding process starts, we need to initialize the interleaver and de-interleaver memory. Because generating the interleaving sequence needs many multiplications and divisions and look-up tables, we do not implement it in the hardware. Instead, we use software to calculate the interleaving sequence and input the sequence into the interleaver memory and de-interleaver memory before start to decode. So we input the interleaving sequence to the interleaver memory first. At the same time, we take the data input to the interleaver as the writing address and take the writing address of the interleaver as the input data for the deinterleaver memory. After initializing the interleaver/de-interleaver memories, the decoding process begins.

Assume the received data comprising one frame of information are in the input memory_d, frame size = N , block length =L_HW , block number p= éêN L/ _HWùú. If

N£LHW, this decoder works as the normal Max-Log-MAP decoder does. In order to explain the halfway decoding process, we further assume N = ×p L_HW where pÎ¥ then we can divide one frame into block-1, block-2… block-p, denoted as

where 1, 2,...,

sbi iÎ p. Because 3GPP encoder has trellis termination, the encoded code has 12 tail bits corresponding to 6 trellis stages as stated in section 4.2. Hence

the decoder will calculate backward state metrics by these tail bits first in order to process regularly hereafter. When we finish computing the tail bits, the backward state metrics are stored to MemB as the initialization state metrics forsb . _p

Now the first half iteration decoding begins. We use y y to calculate _s, _p

backward state metrics first by for sb from _i s_{i L}_×_HW_-₁ to s_{( 1)}_i_- _L_HW where i=1,2,...,p. The initialization state metrics for each block are all set to zeros at this iteration except the last block. The state metrics which input to ACSP are saved to MemA and the last calculated state metrics are stored to MemB at address

(

^{63 i p}^{- +}

)

^when

i¹ . Afterwards the forward state metrics are calculated for sb from _i s_{( 1)}_i_- _L_HW to

HW 1

si L_× _- where i=1,2,...,p. The state metrics which input to ACSP are sent to 0L , 1L units with the relative backward state metrics stored in MemA. The extrinsic information can be computed and saved to MemExtIL according to the relative interleaving address. The decoding process in second half iteration is similar to first half iteration but is different from: using y' , '_s y to calculate the backward/forward _p state metrics; the last calculated state metrics of each block are stored to MemB at address

(

^{31 i p}^{- +}

)

^when ⁱ^¹¹; the computed extrinsic information is saved to MemExt according to the relative de-interleaving address.

When first iteration completes, MemB will have the initialization state metrics for backward state metrics for sb where _i i=1,2,...,p. The graphic representation of the halfway SISO algorithm is shown in Figure 6.10.

position

Figure 6.10: graphic representation of the halfway SISO algorithm

Although we assume N = ×p L_HW where pÎ¥ in the decoding process

discussed above, N would not equal to multiples of L_HW in general. The derivations can easily be modified to apply to the general cases.

_________________________________________

Chapter 7 Hardware implementation

_________________________________________

With modern VLSI technology, we can design the hardware with high clock rate and complicated functions. There are two design abstractions: Bottom-up and top-down. By using the abstractions, the designer can collapse details and arrive at a simpler concept with which to deal.

In the design process of integrated circuit, the layout techniques are very amateur so that we can use Computer-aided design (CAD) tool to help us to place and route.

Nowadays most of the digital communication integrated circuits adopt the standard cell design instead of full custom design. Therefore the emphasis is put on the algorithms and the hardware architectures. In this thesis we also adopt the standard cell to design the hardware.

7.1 Design and verify process

First we write a C program to simulate the decoding algorithm so that we can understand the flow of the decoding process. And we can verify the C program by examining a lot of data.

Second we plan the hardware architecture. In this thesis, we implement the decoder by halfway memory saving method. Thus we can achieve the 3GPP requirement by using only one ACS processor without high operation frequency. Then we develop a bit-accurate C model according to the above architecture. Because we

use fixed-point implementation, we could analyze the word length of the quantities by this bit-accurate C model. It is easier to modify the word length and the architecture in C code than in HDL code. If we find the specification can not satisfy our objective, we could redesign the architecture or change the word length easily and quickly.

Besides, C model can help us to process HDL debugging easily.

Third we can proceed to RTL verification. When the functions of the RTL code work correctly, we can synthesize the code with synthesis tools. If the synthesis result could not satisfy our requirement, we need to modify the architecture and repeat the flow from bit-accurate C model.

Finally, if the synthesis result achieves the requirement, we can download the RTL code to FPGA develop board. Afterward we verify the hardware circuit by inputting a lot of data.

In summary, our develop and design flow is shown in Figure 7.1

Figure 7.1: develop and design flow

7.2 Hardware specification

In this section, we will describe the clock cycles for decoding one block of data first. Then we define the hardware input and output ports clearly.

7.2.1 Clock cycles for decoding one data frame

The clock cycles for decoding one data frame are dependent on the frame size.

Our hardware is pipelined into five stages. Thus the internal latency is 5 clock cycles.

The total required clock cycles for decoding one data frame are calculated as follows:

( )

_ _{N Iter}, 2 2 _ 6

clock cycles = × × frame size internal_delay Iter+ × + (35) The subscript of clock cycles “N” stands for the frame size of the data for simplicity.

The first term in the inner parentheses “frame_size” is also the frame size of the data.

The term “Iter” is the number of complete decoding iteration. The last term “6” is the clock cycles for calculating the tail bits.

Since the frame size of 3GPP turbo code ranges from 40 to 5114, we list some examples as follows:

Iteration = 5

Frame size 40 500 1024 5114

clock cycles 856 10056 20536 102336 Iteration = 10

Frame size 40 500 1024 5114

clock cycles 1706 20106 41006 204666 Table 7.1: decoding clock cycles for different frame size

When frame size is small, the internal delay will affect the decoding cycles severely.

7.2.2 Hardware interface

For convenience, we pack the decoder as a processing core and indicate the input/output ports in Table 7.2. When this processing core is used, we only need to configure the pins adequately. The I/O diagram of this processing core is shown in Figure 7.2.

Figure 7.2: Turbo deocder I/O diagram

Port i/o bit width description clk input 1 system clock

reset input 1 reset the register contents FS input 13 configure the frame size of data d_in input 7 received data input

IL_seq input 13 interleave sequence input Iteration input 5 configure the iteration number

Valid output 1 indicate the decode bit valid Decode_bit output 1 decode bit output

Complete output 1 indicate finish decoding one block of data

Table 7.2: I/O ports definition

7.3 ASIC performance

We are interested in how many gate counts are in the turbo decoder hardware. So we will divide the turbo decoder into two part, one is memory part and the other is control and computation part. The ASIC verification flow is shown in Figures 7.3. We use MATLAB to generate the encoded sequence and the additive white Gaussian noise and write the information into test bench. We can compare the results with the decoding bits by bit-accurate C decoding program. If “Out_cp” outputs “1”, there should be something wrong in the decoder hardware.

The ASIC simulation environment is as follows:

HDL: verilog

Compiler tool: verilog-XL Debug tool: Debussy Synthesis tool: synopsys Process: TSMC 0.25 mm

The simulation results are listed in Table 7.3. The maximum clock rate for this decoder is 102.56MHz.

Figure 7.3: ASIC verification flow

Constraint 9.75ns 10ns 12.5ns 25ns Clock rate 102.56MHz 100MHz 80MHz 40MHz Gate counts 28.7k 28.1k 24.8k 15.1k

Table 7.3: ASIC simulation results

Along with equation (35) in section 7.2.1, we can calculate the clock rate required for decoding the data. Assume required output data rate = Rd, frame size = N and iteration number = Iter, we can get:

, ^d _ ,

N Iter N Iter

required clock rate R clock cycles

= N ×

In 3GPP, maximum Rd is 2 Mbps, thus required clock rate is:

clock rate Iter=5 Iter=10

N=40 42.8 85.3

N=5114 40.02 80.04

Table 7.4: required clock rate for decoding different frame size and iteration Because our hardware has maximum operation frequency 102.56 MHz, it can meet 3GPP requirement.

7.4 FPGA verification

We use MATLAB to generate the encoded sequence and the additive white Gaussian noise. We use the bit-accurate C decoder to decode the received sequence and write the decoding results into a file. Then we put the received information into ROM of the turbo decoder and compare the decoding results with those generating by the bit-accurate C decoder. The output bit and the comparison results are displayed in the seven-segment display. The FPGA verification flow is shown in Figure 7.4.

The simulation environment is as follows:

FPGA development board: Altera stratix II EP1S25780C5 Simulation software: Quartus II 4.0

HDL: verilog

Max. clock rate = 40.2 MHz

Figure 7.4: FPGA verification flow

_____________________________________________________________________

Chapter 8 Conclusion and Future works

_________________________________________

8.1 Conclusion

In this thesis, we implement an efficient and memory saving 3GPP turbo decoder which uses the halfway method. This decoder bases on Max-Log-MAP algorithm and uses only one ACS processor. This successfully decreases the memory capacity which is the critical design problem for turbo decoders. It also discards the redundant calculations for initializations which are required for other decoding methods. As a result, using only one ACS processor in our decoder will not slow down the decoding speed. Furthermore, using halfway memory saving method in the decoder can decrease the decoding latency. By use of the computer simulation and the analyses, we decide the fixed point representations and the block length for halfway method in order to obtain a cost-effective turbo decoder. We compare the BER performance of halfway with the commonly-used sliding window schemes and confirm that our approach does not sacrifice any performance.

8.2 Future works

Our hardware design still can be improved in 3 aspects:

1. Decoding speed: Though our decoder hardware can satisfy the maximum decoding speed of 3GPP specification, 2M bits/s, by 5 iterative decoding at 40.2 MHz operation frequency, the need for more iterations and faster decoding speed

will still exist in the future. Therefore we can use one more ACS processor to calculate forward state metrics when the original ACS processor calculates backward state metrics at the same time. This will boost the decoding speed by a little overhead and hardware requirement.

2. Stopping criterion: we do not implement any stopping criterion on our decoder, thus the decoder will decode for fixed number of iterations. This results in consuming energy unnecessary and wasting the decoding time.

3. Embedded interleaver/de-interleaver generator: At the moment we assume the interleave/de-interleaver data are stored to the memory and these will cost a lot of memory. If we can design the hardware for generating interleaving/de-interleaving sequence when needed immediately, it will decrease the memory capacity needed by decoder significantly. More exactly, that is 2x13x5114 = 132964 =132.9 K bits

= 16.6125 Kbytes.

References

[1] C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, pp. 379-427, 1948.

[2] R. Hamming, “Error Detecting and Error Correcting Codes,” Bell System Technical Journal, vol. 29, pp. 147-160, 1950.

[3] E. Prange, “Cyclic Error-Correcting Codes in Two Symbols,” Air Force Cambridge Research Center-TN-57-103, Cambridge, MA: September 1957.

[4] E. Prange, “Some Cyclic Error-Correcting Codes with Simple Decoding Algorithms,” Air Force Cambridge Research Center-TN-57-103, Cambridge, MA:

September 1957.

[5] E. Prange, “The Use of Coset Equivalence in the Analysis and Decoding of Group Codes,” Air Force Cambridge Research Center-TN-57-103, Cambridge, MA:

September 1957.

[6] A. Hocquenghem, “Codes correcteurs d’erreurs,” Chiffres (Paris), vol. 2, pp.

147-156, September 1959.

[7] R. Bose and D. Ray-Chaudhuri, “On a Class of Error Correcting Binary Group Codes,” Information and Control, vol. 3, pp. 68-79, March 1960.

[8] R. Bose and D. Ray-Chaudhuri, “Further Results on Error Correcting Binary Group Codes,” Information and Control, vol. 3, pp. 279-290, September 1960.

[9] I. Reed and G. Solomon, “Polynomial Codes over Certain Finite Fields,” Journal of the Society of Industry and Applied Mathematics, vol. 8, pp. 300-304, June 1960.

[10] P. Elias, “Coding for Noisy Channels,” IRE Convention Record, pt. 4, pp. 37-47, 1955.

[11] J. Wozencraft, “Sequential Decoding for Reliable Communication,” IRE Natl.

Conv. Rec., vol. 5, pt.2, pp. 11-25, 1957.

[12] J. Wozencraft and B. Reiffen, “Sequential Decoding,” Cambridge, MA, USA:

MIT Press, 1961.

[13] A. Viterbi, “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Transactions on Information Theory, vol.

IT-13, pp. 260-269, April 1967.

[14] G. Forney, “The Viterbi Algorithm,” Proceedings of the IEEE, vol. 61, pp.

268-278, March 1973.

[15] G. Ungerboeck, “Trellis-Coded Modulation with Redundant Signal Sets part I:

Introduction,” IEEE Communications Magazine, vol. 25, pp. 5-11, February 1987.

[16] G. Ungerboeck, “Trellis-Coded Modulation with Redundant Signal Sets part II:

State of the art,” IEEE Communications Magazine, vol. 25, pp. 12-21, February 1987.

[17] C. Berrou, A. Glavieus, and P. Thitimajshima, “Near Shannon Limit Error-Correcting Coding and Decoding: Turbo Codes,” Proceedings of the International Conference on Communications, (Geneva, Switzerland), pp.

1064-1070, May 1993.

[18] C. Berrou, C. Douillard, M. Jezequel, “Designing Turbo Codes for Low Error Rates", IEE Workshop, London, UK, December 1999.

[19] David J. C. MacKay, “Information Theory, Inference and Learning Algorithms,”

Cambridge University Press, pp. 576, September 2003.

[20] Divsalar, D. and Pollara, F., “Turbo Codes for PCS Applications,” Proceedings of International Conference on Communications, Seattle, WA., pp. 54-59, June 1995.

[21] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal Decoding of Linear Codes for Minimising Symbol Error Rate,” IEEE Transactions on Information

Theory, vol. 20, pp. 284-287, March 1974.

[22] J. Hagenauer and P. Hoher, “A Viterbi Algorithm with Soft-decision Outputs and its applications,” IEEE Globe-com, pp. 1680-1686, 1989.

[23] P.Robertson, E. Villebrun, and P. Hoher, “A Comparison of Optimal and Sub-Optimal MAP Decoding Algorithms Operating in the Log Domain,”

Proceedings of the International Conference on Communications, (Seattle, USA), pp. 1009-1013, June 1995.

[24] A. Worm, P. Hoeher, and N. Wehn, “Turbo-Decoding Without SNR Estimation,”

IEEE Communications Letter, vol. 4, pp. 193-195, June 2000.

[25] T. A. Summers and S. G.Wilson, “SNR Mismatch and Online Estimation in Turbo Decoding,” IEEE Trans. Communications., vol. 46, pp. 421–423, April 1998.

[26] Peter H-Y Wu, “On the Complexity of Turbo Decoding Algorithms,” IEEE Vehicular Technology Conference, spring 2001.

[27] A. Viterbi, “An Intuitive Justification and Simplified Implementation of MAP Decoder for Convolutional Codes,” IEEE Select. Areas in Communication, vol. 16, pp. 260-264, February 1998.

[28] F. Raouafi, A. Dingninou, C. Berrou, “Saving Memory in Turbo Decoders using the Max-Log-MAP Algorithm,” IEE Colloquium, pp. 14/1-14/4, November 1999.

[29] http://www.3gpp.org, “Multiplexing and channel coding,” TS 25.212 v 6.2.0.

[30] T. K. Blankenship, B. Classon, “Fixed-point Performance of Low-complexity Turbo Decoding Algorithms,” IEEE Vehicular Technology Conference, spring 2001.

在文檔中具計算效率且節省記憶體的第三代行動通訊渦輪碼解碼器 (頁 68-0)