Fixed-Point Simulations

Chapter 4 Simulation Results and Analysis

4.2 Fixed-Point Simulations

In this section, we furthermore analyze the finite-word-length performance of the LDPC decoder. Possible tradeoff between hardware complexity and decoding performance will be discussed. Let [t:f] denote the quantization scheme in which a total of t bits are used, and f bits are used for the fractional part of the values.

Various quantization configurations such as [6:3], [7:3], [8:4] are investigated here.

1 1.5 2 2.5 3 3.5 4

10^-5 10^-4 10^-3 10^-2 10^-1 10⁰

E_b/N_o

BER

Fixed-Point MS[6:3]

Fixed-Point MS[7:3]

Fixed-Point MS[8:4]

floating-point MS

Figure 4.8 Fixed-point BER simulations of three different quantization

configurations of min-sum decoding algorithm, in AWGN channel, code length=576, code rate=1/2, maximum iteration=10.

1 1.5 2 2.5 3 3.5 4 10^-6

10^-5 10^-4 10^-3 10^-2 10^-1 10⁰

Eb/N o

BER

NMS fixed-point[7:3] beta=0.75 NMS floating-point beta=0.75 DNOMS fixed-point[7:3]

DNOMS floating-point

Figure 4.9 Floating-point vs. fixed-point BER simulations of the normalization and dynamic normalized-offset min-sum algorithm.

Chapter 5

Architecture Designs of LDPC Code Decoders

In this chapter, we will introduce the hardware architectures of the LDPC code decoder in our design and discuss the implementation of an irregular LDPC decoder for 802.16e standard. The decoder has a code rate 1/2 and code length of 576 bits. The parity-check matrix of this code is listed in Appendix A.

5.1 The Whole Decoder Architecture

The parity-check matrix H in our design is in block-LDPC form as we discuss in section 2.2. The parity-check matrix is composed of m_b× sub-matrices. The n_b sub-matrices are zero matrices or permutation matrices with the same size of z z× . The permutations used are circular right shifts, and the set of permutation matrices contains the z z× identity matrix and circular right shifted versions of the identity matrix.

0,0 0,1 0, 1

1,0 1,1 1, 1

b b b b

m m m n

P P P

−

− − − −

⎡ ⎤

⎢ ⎥

= ⎢ ⎥

⎢ ⎥

⎣ ⎦

L L

M M L M

Figure 5.1 The parity check matrix H of block-LDPC Code

In our design, we consider a LDPC code with code-rate 1/2 and 288-by-576 parity-check matrix for 802.16e standard. While considering circuit complexity, the 288-by-576 parity-check matrix H of LDPC code are divided into four 144-by-288 sub-matrices to fit partial-parallel architecture, which is shown in Figure 5.2. The LDPC code decoder architecture in our design is illustrated in Figure 5.4. This architecture contains 144 CNUs, 288 BNUs and two dedicated message memory units (MMU). The set of data processed by CNUs are { ,h₀₀ h₀₁} and { , }h₁₀ h₁₁ , whereas the data fed into BNUs should be { ,h₀₀ h₁₀} and { , }h₀₁ h₁₁ . Note that two MMUs are employed to process two different codewords concurrently without stalls. Therefore, the LDPC decoder is not only area-efficient but also its the decoding speed is comparable with fully parallel architectures.

Figure 5.2 The partition of parity-check matrix H

Figure 5.3 I/O pin of the decoder IP

Figure 5.4 The whole LDPC decoder architecture for the block LDPC code

The I/O pin of the decoder chip is shown in Figure 5.3. Figure 5.4 shows the block diagram of the decoder architecture. The modules in it will be described explicitly in the following. We adopt partial-parallel architectures [19], so the decoder can handle 2 codewords at one time.

Input Buffer [19]

The input buffer is a storage component that receives and keeps channel values for iterative decoding. Channel values should be fed into the COPY module during initialization and BNU processing time.

COPY, INDEX, and ROM modules

The parity-check matrix H is sparse which means there are few ones in the matrix. It is not worth to save the whole parity-check matrix in the memory. So we use the module INDEX to keep the information of H. We take a simple example to explain how these modules work. Figure 5.4 shows the simple parity-check matrix.

Figure 5.5 A simple parity-check matrix example, based on shifted identity matrix.

The parity-check matrix is composed by 4 sub-matrices and the sub-matrices are right-circular-shifted matrices. The shifted numbers are expressed in Figure 5.5. Since the parity-check matrix size in this example is 8-by-8, we receive 8 channel values.

The channel values are assumed to be vr=

[

v1 v2 v3 v4 v5 v6 v7 v8

]

, and then they are fed to the module “COPY”. Figure 5.6 (a) and 5.6 (b) show how modules “COPY”, “INDEX”, “ROM” work. The outputs of the module “INDEX” are

1, , , 2 3 4

iv uv uv uvi i i

. They reserve the channel values and add the indices of the shifted numbers. The indices of the shifted numbers are stored in module “ROM.”

Figure 5.6 (a) The sub-modules of the whole decoder

Figure 5.6 (b) The outputs of the module INDEX

The indices represent the shifted amounts and the information of H. So we place the indices in front of the channel values.

SHUFFLE1, SHUFFLE2 modules

Before sending the values to the check-node update unit, we have to shuffle left the values in order to give the correct positions when doing check-node computation and shuffle right the values before doing the bit-node computation. The amount of the shuffling value is decided by the index numbers. Figure 5.7(a) and 5.7(b) show how modules SHUFFLE1 and SHUFFLE2 work. In this example,

2 7 3 8 4 5 1 6

( , ),( , ),( , ),( , )v v v v v v v v are the input pairs of the check-node update unit.

Before sending the values to the bit-node update unit, we have to shuffle back the values. Thus we can have the correct answers.

Figure 5.7(a) Values shuffling before sending to check-node update unit

Figure 5.7(b) Values shuffling before sending to bit-node update unit CNU[15]

Check node update units (CNUs) are used to compute the check node equation.

The check-to-bit message r for the check node _{m l}_, m and bit node l using the incoming bit-to-check messages q_{m l}_, is computed by CNU as follows

, , , '

( )\

( ) min{ }

m l m l m l

l L m l

r sign q _′ q

′∈

∏

× ^(5.1)

where ( ) \L m l denotes the set of bit nodes connected to the check node m except l. Figure 5.8(a) shows the architecture of the CNU using the min-sum algorithm. The check node update unit has 6 inputs and 6 outputs. In Figure 5.8(a) and 5.8(b), the output of “MIN” is the minimal value of the 2 inputs. The aim of this circuit is to find the minimal value of the other 5 inputs. This architecture is quite straightforward.

Figure 5.8(b) shows the architecture of the CNU using the proposed modified min-sum algorithm.

Figure 5.8(a) The architecture of CNU using min-sum algorithm

Figure 5.8(b) The architecture of CNU using modified min-sum algorithm

The other way to implement equation (5.1) is to search the minimal value and the second minimal value from inputs. Figure 5.9 shows the block diagram of the compare-select unit (CS6). The detailed architecture of CMP-6 in Figure 5.9 is illustrated in Figure 5.10, which consists of two kinds of comparators: CMP-2 and CMP-4. CMP-4 finds out the minimal and the second minimal values from the four inputs, a, b, c , and d. In addition, CMP-2 is a two input comparator which is much simpler than CMP-4.

Figure 5.9 Block diagram of CS6 module

Figure 5.10(a) Block diagram of CMP-4 module

Figure 5.10(b) Block diagram of CMP-6 module

The whole architecture of the 6-input CNU is shown in Figure 5.11.

Figure 5.11 CNU architecture using min-sum algorithm

Table 5.1 compares the hardware performance of two different CNU architectures. We call the architecture in Figure 5.8(a) is direct CNU architecture and the architecture in Figure 5.11 is backhanded CNU architecture. We can find that the direct CNU architecture has only 45% size of the backhanded CNU architecture. So we choose the direct CNU architecture.

Table 5.1 Comparison of direct and backhanded CNU architectures

Direct CNU architecture Backhanded CNU architecture

Area (gate count) 0.52k 1.16k

Speed (MHz) 100 100

Power Consumption (mW)

4.82 10.85

BNU

Figure 5.12 shows the architecture of the bit node update unit for 4 inputs. “SM”

means the sign-magnitude representation and “2’s” means the two’s compliment representation. While finding the absolute minimal value of two inputs, sign-magnitude representation is more suitable for hardware implementation than two’s compliment. In contrast, while adding computation, two’s compliment representation is more suitable for hardware implementation than sign-magnitude representation.

Figure 5.12 The architecture of the bit node updating unit with 4 inputs

MMU0 and MMU1 [19]

In [19], it introduces a partial-parallel decoder architecture that can increase the decoder throughput with moderate decoder area. We adopt the partial-parallel architecture in our design and make an improvement in the message memory units.

Message memory units (MMU) are used to store the message values that are generated by CNUs and BNUs. To increase the decoding throughput, two MMUs are employed to concurrently process two different codewords in the decoder. The register exchange scheme based on four sub-blocks (RE-4B) is proposed as shown in Figure 5.13(a). In MMU, sub-blocks A, B, D capture the outputs from CNU while sub-blocks C and D deliver the message data to SHUFFLE2. The detailed timing diagram of MMU0 and MMU1 are illustrated in Figure 5.13(b). h_xy⁽⁰⁾ means the copied message of codeword 0 and h_xy⁽¹⁾ means that of codeword 1.

Figure 5.13(a) The architecture of RE-4B based MMU

Figure 5.13(b) The timing diagram of the message memory units

While in the iterative decoding procedure, MMU0 and MMU1 pass messages to each other through SHUFFLE1, CNU, SHUFFLE2, and BNU modules. Disregarding the combinational circuit, the detailed relationship and snapshots between MMU0 and MMU1 is shown in Figure 5.14.

Figure 5.14 The message passing snapshots between MMU0 and MMU1

5.2 Hardware Performance Comparison and Summary

To compare the area, speed, latency, and power consumption of the architectures discussed in this section, we describe the hardware architectures in VHDL, and afterwards simulate and synthesize it using EDA tools Synopsis^TM, PrimePower, and DesignAnalyzer. The process technology is UMC 0.18 mµ process. Table 5.2 lists the results of CNU using min-sum algorithm and the proposed modified min-sum algorithm.

Table 5.2 Area, speed, and power consumption of the CNU using min-sum algorithm and modified min-sum algorithm

6 input CNU 6 input CNU (modified)

7 input CNU 7 input CNU (modified) Area

(gate count)

0.52k 0.57k 0.72 0.79

Speed (MHz) 100 100 100 100

Power Consumption

(mW)

4.82 4.96 6.77 7.1

As mentioned before, two different codewords are processed concurrently without any stalls. In our proposed design, BNUs and CNUs have no idle time. Hence, it leads to an efficient utilization of the functional units. The design takes four cycles to complete a decoding iteration for each codeword, including two cycles for horizontal steps in CNUs and two cycles for vertical steps in BNUs. For channel value loading, each codeword takes two extra cycles. Since the maximum iteration of the decoding procedure is 10, the total amount of cycles needed to complete the decoding of two different codewords is 2+2+10*4=44 cycles. According to our initial synthesis results, the clock frequency is 100MHz, thus the data decoding throughput is 100*[1152*(1/2)]/44≈ 1.31 Gbps.

The proposed LDPC decoder is compared with other designs as listed in Table 5.3. The objective of our design is to devise a high throughput LDPC decoder with little chip area. Partial-parallel decoder architecture can meet our demand. Compared with [19], our design has lower data throughput. Because our decoder design has shorter code length and lower code rate. In our design, one codeword has 288 message bits. In [19], one codeword has 720 bits. Moreover considering the BER

performance, we choose the iteration number=10. This also reduces the data throughput. The superiority of our design is the chip area. Although we choose higher quantization bits, the chip area in our design has 82.6% of the design in [19] and 54.3% of the design in [17].

Table 5.3 Comparison of LDPC decoders Proposed LDPC

decoder

[19] [17]

Code length 576 1200 1024

Code rate 1/2 3/5 1/2

Quantization bits 7 6 4

Iteration number 10 8 10

Architecture Partial-parallel Partial-parallel Fully-parallel Process

Technology (μm) 0.18 0.18 0.16

Clock rate (MHz) 100 83 64

Power (mW) 620 644 690

Area (gate count) 950k 1150k 1750k

Throughput

(Mbps) 1310 3330 500

Chapter 6 Conclusions and Future Work

6.1 Conclusions

From this work, we summarize that using dynamic normalized-offset technique in LDPC decoder can further improve the error correction performance when compared with the conventional method. Various simulation results of LDPC decoder are investigated and the optimal choice considering the tradeoff between the hardware complexity and the performance have been discussed in this thesis.

In this thesis, with partial-parallel architecture, high-throughput and area-efficient LDPC code decoders are proposed for high-speed communication systems. A (576, 288) LDPC code in 802.16e standard has been implemented, of which the code rate is 1/2, the code length is 576 bits, and the maximum number of decoding iterations is 10. The LDPC decoder in our design can achieve a data throughput of 1.31 Gbps and the chip area is 950k gates using the UMC 0.18 mµ process technology.

6.2 Future Work

The normalization factor β and the offset factor α influence the decoder BER performance quite large. Through our research, we found that our proposed dynamic normalized-offset technique and dynamic normalization technique [23] have

similar BER decoding performance. The other idea is to dynamically adjust the two factors α and β in the same time. The threshold values of α and β may be obtained through simulations. Moreover, as mentioned in Appendix A, there are a lot of different codeword lengths and code rates in 802.16e standard. Our future work is to integrate the multi-mode 802.16e LDPC decoder design.

Appendix A

LDPC Codes Specification in IEEE 802.16e

OFDMA

The LDPC code in IEEE802.16e is a systematic linear block code, where k systematic information bits are encoded to n coded bits by adding m= −n k parity-check bits. The code-rate is k n/ .

The LDPC code in IEEE802.16e is defined based on a parity-check matrix H of size m n× that is expanded from a binary base matrix H with size _b m_b× , where n_b m= ⋅z mb and n= ⋅ . In this standard, there are six different base matrices. One z n_b for the rate 1/2 code is depicted in Figure A.1. Two different ones for two rate 2/3 codes, type A is in Figure A.2 and type B is in Figure A.3. Two different ones for two rate 3/4 codes, type A is in Figure A.4 and type B is in Figure A.5. One for the rate 5/6 code is depicted in Figure A.6. In these base matrices, size n is an integer equal to _b 24 and the expansion factor z is an integer between 24 and 96. Therefore, we can compute the minimal code length as n_min =24 24 576× = bits and the maximum code length as n_max =24 96 2304× = bits.

For codes 1/2, 2/3B, 3/4A, 3/4B, and 5/6, the shift sizes ( , , )p f i j for a code size corresponding to the expansion factor z are derived from _f p i j , which is the ( , ) element at the i-th row, j-th column in the base matrices, by scaling ( , )p i j proportionally as

0 permutation matrix. The permutation matrix represents a circular right shift by

( , , )

Figure A.1 Base matrix of the rate 1/2 code

Rate 2/3 A code:

Figure A.2 Base matrix of the rate 2/3, type A code

Rate 2/3 B code:

Figure A.3 Base matrix of the rate 2/3, type B code

Rate 3/4 A code:

Figure A.4 Base matrix of the rate 3/4, type A code

Rate 3/4 B code:

Figure A.5 Base matrix of the rate 3/4, type A code

Rate 5/6 code:

Figure A.6 Base matrix of the rate 5/6 code

References

[1] R. G. Gallager, Low-density parity-check codes, Cambridge, MA: MIT Press, 1963.

[2] D. J. C. Mackay and R. M. Neal, “Near Shannon limit performance of low density parity check codes,” Electron. Lett., Vol. 32, pp. 1645-1646, Aug. 1996.

[3] T. J. Richardson and R. L. Urbabke, “Efficient encoding of low-density parity-check codes,” IEEE Trans. Inform. Theory, Vol. 47, pp. 638-656, Feb.

2001.

[4] D. J. C. Mackay, S. T. Wilson, and M. C. Davey, “Comparison of constructions of irregular gallager codes,’’ IEEE Trans. Comm., Vol. 47, pp. 1449-1454, Oct.

1999.

[5] S. J. Johnson and S. R. Weller, “A family of irregular LDPC codes with low encoding complexity,” IEEE Comm. Lett., Vol. 7, pp. 79-81, Feb. 2003.

[6] J. Chen, A. Dholakia, E. Eleftheriou, and M.P.C. Fosoorier, and X.Y. Hu,

“Reduced-complexity decoding of LDPC codes,” IEEE Trans. Commun., Vol. 53, pp. 1288-1299, July 2005.

[7] R. Tanner, “A recursive approach to low complexity codes,” IEEE Trans. Inform.

Theory, Vol. 27, pp. 533-547, Sep. 1981.

[8] M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, and V. Stemann,

“Practical loss-resilient codes,” IEEE Trans. Inform. Theory, Vol. 47, pp. 569-584, Feb. 2001.

[9] T. J. Richardson, M. A. Shokrollashi, and R. L. Urbanke, “Design of

capacity-approaching irregular low-density parity-check codes,” IEEE Trans.

Inform. Theory, Vol. 47, pp. 619-637, Feb. 2001.

[10] D. J. C. Mackay, “Good error-correcting codes based on very sparse matrices,”

IEEE Trans. Inform. Theory, Vol. 45, pp. 399-431, Mar. 1999.

[11] F. R. Kschischang, B. J. Frey, and H. A. Loeliger, “Factor graphs and the sum-product algorithm,” IEEE Trans. Inform. Theory, Vol. 47, pp. 498-519, Feb.

2001.

[12] H. Futaki and T. Ohtuski, “Low-density parity-check (LDPC) coded OFDM systems,” IEEE VTS, Vol. 1, pp. 82-86, Fall. 2001.

[13] X. Y. Hu, E. Eleftheriou, D. M. Arnold, and A. Dholakia, “Efficient implementation of the sum-product algorithm for decoding LDPC codes,” IEEE GLOBECOM’01, Vol. 02, pp. 1036-1036E, Nov. 2001.

[14] Jinghu Chen and Marc P.C. Fossorier, “Near Optimum Universal Belief Propagation Based Decoding of Low-Density Parity Check Codes,” IEEE Trans.

on Commun., Vol. 50, pp. 583-587, NO.3 Mar. 2002.

[15] Marjan Karkooti and Joseph R. Cavallaro, “Semi-parallel reconfigurable architectures for real-time LDPC decoding,” IEEE ITCC’04 Vol. 65, pp.

683-689.

[16] Z. Wang, Y. Chen, and K. K. Parhi, “Area efficient decoding of quasi-cyclic low density parity check codes,” IEEE ICASSP’04, Vol. 5, pp. 49-52, May. 2004.

[17] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check code decoder,” IEEE J. Solid-State Circuits, Vol. 37, pp.

404-412, Mar. 2002.

[18] Y. Chen and D. Hocevar, “A FPGA and ASIC implementation of rate 1/2, 8088-b irregular low density parity check decoder,” IEEE GLOBECOM’03, Vol.

3, pp. 113-117, Dec. 2003.

[19] Chien-Ching Lin, Kai-Li Lin, Hsie-Chia Chang and Chen-Yi Lee, “A 3.33Gb/s (1200,720) low-density parity check code decoder,” IEEE Proceedings of ESSCIRC, Grenoble, France, 2005.

[20] T.M.N. Ngatched, M. Bossert, and A. Fahrner, “Two decoding algorithms for low-density parity check codes,” IEEE ITCC’04, Vol. 32, pp. 253-257.

[21] Yuan-Jih Chu and Sau-Gee Chen, “An efficient LDPC code structure combined with the concept of difference family,” IWCMC’06, Vol. 18, pp.

355-360.

[22] I. V. Kozintsev. Software for low-density parity-check codes. [Online] Available at: http://www.kozintsev.net/soft.html.

[23] Yen-Chin Liao, Chien-Ching Lin, Chih-Wei Liu, and Hsie-Chia Chang, “A dynamic normalization technique for decoding LDPC codes,” IEEE Signal Processing Systems Design and Implementation, pp. 768-772, Nov. 2005.

自傳

邱敏杰，1982 年 6 月 15 日出生，高雄縣人。2004 年自國立暨南國際大學電機工程學系畢業，隨即進入國立交通大學電子研究所攻讀碩士學位。研究興趣為通訊系統與數位信號處理，碩士論文題目為低密度對偶檢查碼解碼演算法之改進以及其高速解碼器架構之設計。

在文檔中低密度對偶檢查碼解碼演算法之改進以及其高速解碼器架構之設計 (頁 61-0)