• 沒有找到結果。

Dynamic Normalization Technique for Min-Sum Algorithm

Chapter 3 Modified Min-Sum Algorithms

3.2 Dynamic Normalization Technique for Min-Sum Algorithm

Figure 3.2 The absolute difference between the normalization technique and sum-product algorithm, vs. the normalization factorβ

3.2 Dynamic Normalization Technique for Min-Sum Algorithm [23]

In section 3.1, one can use the normalized factor β to compensate the result of equation (3.2) so that it can approximate equation (3.1) more accurately. In [23], it shows the idea to adjust the normalized factor β dynamically to get better decoding performance. Thus the normalization factor β can have the form:

1

2

, when , when

B K

B K

β β β

⎧ <

= ⎨⎩ ≥ (3.15)

In [23], it selects two normalization factors β1 and β2 first. For convenience of hardware implementation, only certain simple values of β1 and β2 should be chosen for finite-precision realizations. For check node degree of 6, it found that

1 0.75

β = and β2 =0.875 are good choices. Then through simulations, one can find the optimum threshold value K to have the lowest decoder BER. The detailed

simulation results are in chapter 4.

3.3 Proposed Dynamic Normalized-Offset-Compensation Technique for Min-Sum Algorithm

Compared to the dynamic normalization technique, one can extend the idea by adding an additional offset factor α to equation (3.2) [6] in order to get even more accurate check-node updating values. Equation (3.16) shows the normalized-offset technique for min-sum algorithm.

1 2 1 2

1

( w) w ( ){i min[ , , , w] }

i

CHK L L L sign L β L L L α

=

⊕ ⊕ ⊕L =

L + (3.16)

In section 3.1, we have decided the value 0.75 of β when the check node degree is 6.

Through simulations in chapter 4, we find that for fixed value of α , the decoding performance is not always better than that of α =0. So we have the idea to adjust the offset factor α dynamically.

Now, we have the inspiration if the offset factor α can be dynamically adjusted to get better performance. Equation (3.17) shows the dynamic offset factor α .

1

2

, when , when

B K

B K

α α α

⎧ <

= ⎨⎩ ≥ (3.17)

Through simulations, we can decide the best values of α1 and α2. As we discuss in section 3.1, In hardware implementation, only certain simple values of α1and α2 will be chosen for finite-precision realizations. For check node degree of 6, we found that α1= and 0 α2 =0.125 are good choices.

In the following, we are going to decide the threshold K for a particular LDPC Code. Figure 3.3 shows the selection of K for rate 1/2 LDPC code vs. SNRs. K=0

means that we have fixed offset factor α . Otherwise, we have the dynamic offset factor α . In Figure 3.4, we can find the threshold value K equal to 1.5 is a good choice. The detail simulation results will be shown in chapter 4.

Figure 3.3 BER performance vs. threshold values K for rate 1/2 LDPC code

Chapter 4

Simulation Results and Analysis

In the beginning of this chapter, we will make a comparison of error correction performances by using different structures of the parity-check matrices such as randomly constructed code, and block-LDPC code in 802.16e standard. Then we will make a comparison of error correction performance with major decoding algorithms for LDPC codes such as sum-product algorithm, min-sum algorithm, and the proposed improved min-sum algorithm. In the end, we will furthermore analyze the finite-precision effects on the decoding performance, and decide proper word lengths of variables considering tradeoffs between the performance and the hardware cost.

Before proceeding to the following simulations, some parameters should be described here:

1: The randomly constructed codes are derived from [22], and they have a regular column weight and row weight.

2: The block-LDPC code used is for 802.16e standard.

3: For the decoding algorithm, we adopt the sum-product algorithm, min-sum algorithm, and the proposed modified min-sum algorithm.

4: We assume AWGN channels and BPSK modulation as our test environment conditions.

4.1 Floating-Point Simulations

One of the most important factors of concern when decoding the received signals is the iteration number. As the number becomes larger, the correct codewords are more likely to be decoded. However, more iterations imply higher computation cost and latency. Therefore, we need to choose a proper iteration number in the decoding process. In Figure 4.1, we show the BER simulation results vs. SNR, with different iteration numbers, for the LDPC code at rate 1/2 and length 576, BPSK, and sum-product decoding algorithm are adopted. We can find that the performance improvement tends to be insignificant after 10 iterations, which is about 0.2 dB. As a result, LDPC decoding with 10 iterations is considered as a good choice for practical implementation.

1 1.5 2 2.5 3 3.5 4

10-5 10-4 10-3 10-2 10-1 100

Eb/No

BER

iteration=1 iteration 10 iteration 20 iteration 30 iteration 50

Figure 4.1 Decoding performance at different iteration numbers.

1 1.5 2 2.5 3 3.5 4 10-7

10-6 10-5 10-4 10-3 10-2 10-1

Eb/No

BER

Length 576 length 2304

Figure 4.2 BER Performance of the rate-1/2 code at different codeword lengths, in AWGN channel, maximum iteration=10.

1 1.5 2 2.5 3 3.5 4

10-6 10-5 10-4 10-3 10-2 10-1 100

Eb/No

BER

Min-Sum Algorithm Sum-Product Algorithm

Figure 4.3 Floating-point BER simulations of two decoding algorithms in AWGN channel with code length=576, code rate=1/2, maximum iteration=10.

1 1.5 2 2.5 3 3.5 4 10-6

10-5 10-4 10-3 10-2 10-1 100

Eb/N o

BER

Normalized min-sum:beta=0.5 Normalized min-sum:beta=0.75 Normalized min-sum:beta=0.875 Sum-Product

Figure 4.4 Floating-point BER simulations of normalized min-sum decoding algorithms in AWGN channel with code length=576, code rate=1/2, maximum iteration=10.

1 1.5 2 2.5 3 3.5 4

10-6 10-5 10-4 10-3 10-2 10-1 100

Eb/No

BER

afa=0.25 afa=-0.25 afa=0 afa=0.125

Figure 4.5 Floating-point BER simulations under normalized-offset technique in min-sum decoding algorithms, in AWGN channel with code length=576, code rate=1/2, maximum iteration=10.

1 1.5 2 2.5 3 3.5 4 10-6

10-5 10-4 10-3 10-2 10-1 100

Eb/N o

BER

NMS:beta=0.75 Proposed DNOMS Sum-Product

Figure 4.6 Floating-point BER simulations of the dynamic normalized-offset min-sum decoding algorithm and its comparison with other algorithms, in AWGN channel with code length=576, code rate=1/2.

1 1.5 2 2.5 3 3.5 4

10-6 10-5 10-4 10-3 10-2 10-1 100

Eb/No

BER

Dynamic Normalization Proposed DNOMS

Dynamic Normalization with Offset Factor Sum-Product

Figure 4.7 Floating-point BER simulations under normalized-offset-compensated technique and dynamic normalization technique in min-sum algorithm.

4.2 Fixed-Point Simulations

In this section, we furthermore analyze the finite-word-length performance of the LDPC decoder. Possible tradeoff between hardware complexity and decoding performance will be discussed. Let [t:f] denote the quantization scheme in which a total of t bits are used, and f bits are used for the fractional part of the values.

Various quantization configurations such as [6:3], [7:3], [8:4] are investigated here.

1 1.5 2 2.5 3 3.5 4

10-5 10-4 10-3 10-2 10-1 100

Eb/No

BER

Fixed-Point MS[6:3]

Fixed-Point MS[7:3]

Fixed-Point MS[8:4]

floating-point MS

Figure 4.8 Fixed-point BER simulations of three different quantization

configurations of min-sum decoding algorithm, in AWGN channel, code length=576, code rate=1/2, maximum iteration=10.

1 1.5 2 2.5 3 3.5 4 10-6

10-5 10-4 10-3 10-2 10-1 100

Eb/N o

BER

NMS fixed-point[7:3] beta=0.75 NMS floating-point beta=0.75 DNOMS fixed-point[7:3]

DNOMS floating-point

Figure 4.9 Floating-point vs. fixed-point BER simulations of the normalization and dynamic normalized-offset min-sum algorithm.

Chapter 5

Architecture Designs of LDPC Code Decoders

In this chapter, we will introduce the hardware architectures of the LDPC code decoder in our design and discuss the implementation of an irregular LDPC decoder for 802.16e standard. The decoder has a code rate 1/2 and code length of 576 bits. The parity-check matrix of this code is listed in Appendix A.

5.1 The Whole Decoder Architecture

The parity-check matrix H in our design is in block-LDPC form as we discuss in section 2.2. The parity-check matrix is composed of mb× sub-matrices. The nb sub-matrices are zero matrices or permutation matrices with the same size of z z× . The permutations used are circular right shifts, and the set of permutation matrices contains the z z× identity matrix and circular right shifted versions of the identity matrix.

0,0 0,1 0, 1

1,0 1,1 1, 1

1,0 1,1 1, 1

b

b

b b b b

n

n

m m m n

P P P

P P P

H

P P P

⎡ ⎤

⎢ ⎥

⎢ ⎥

= ⎢ ⎥

⎢ ⎥

⎢ ⎥

⎣ ⎦

L L

M M L M

L

Figure 5.1 The parity check matrix H of block-LDPC Code

In our design, we consider a LDPC code with code-rate 1/2 and 288-by-576 parity-check matrix for 802.16e standard. While considering circuit complexity, the 288-by-576 parity-check matrix H of LDPC code are divided into four 144-by-288 sub-matrices to fit partial-parallel architecture, which is shown in Figure 5.2. The LDPC code decoder architecture in our design is illustrated in Figure 5.4. This architecture contains 144 CNUs, 288 BNUs and two dedicated message memory units (MMU). The set of data processed by CNUs are { ,h00 h01} and { , }h10 h11 , whereas the data fed into BNUs should be { ,h00 h10} and { , }h01 h11 . Note that two MMUs are employed to process two different codewords concurrently without stalls. Therefore, the LDPC decoder is not only area-efficient but also its the decoding speed is comparable with fully parallel architectures.

Figure 5.2 The partition of parity-check matrix H

Figure 5.3 I/O pin of the decoder IP

Figure 5.4 The whole LDPC decoder architecture for the block LDPC code

The I/O pin of the decoder chip is shown in Figure 5.3. Figure 5.4 shows the block diagram of the decoder architecture. The modules in it will be described explicitly in the following. We adopt partial-parallel architectures [19], so the decoder can handle 2 codewords at one time.

Input Buffer [19]

The input buffer is a storage component that receives and keeps channel values for iterative decoding. Channel values should be fed into the COPY module during initialization and BNU processing time.

COPY, INDEX, and ROM modules

The parity-check matrix H is sparse which means there are few ones in the matrix. It is not worth to save the whole parity-check matrix in the memory. So we use the module INDEX to keep the information of H. We take a simple example to explain how these modules work. Figure 5.4 shows the simple parity-check matrix.

Figure 5.5 A simple parity-check matrix example, based on shifted identity matrix.

The parity-check matrix is composed by 4 sub-matrices and the sub-matrices are right-circular-shifted matrices. The shifted numbers are expressed in Figure 5.5. Since the parity-check matrix size in this example is 8-by-8, we receive 8 channel values.

The channel values are assumed to be vr=

[

v1 v2 v3 v4 v5 v6 v7 v8

]

, and then they are fed to the module “COPY”. Figure 5.6 (a) and 5.6 (b) show how modules “COPY”, “INDEX”, “ROM” work. The outputs of the module “INDEX” are

1, , , 2 3 4

iv uv uv uvi i i

. They reserve the channel values and add the indices of the shifted numbers. The indices of the shifted numbers are stored in module “ROM.”

Figure 5.6 (a) The sub-modules of the whole decoder

Figure 5.6 (b) The outputs of the module INDEX

The indices represent the shifted amounts and the information of H. So we place the indices in front of the channel values.

SHUFFLE1, SHUFFLE2 modules

Before sending the values to the check-node update unit, we have to shuffle left the values in order to give the correct positions when doing check-node computation and shuffle right the values before doing the bit-node computation. The amount of the shuffling value is decided by the index numbers. Figure 5.7(a) and 5.7(b) show how modules SHUFFLE1 and SHUFFLE2 work. In this example,

2 7 3 8 4 5 1 6

( , ),( , ),( , ),( , )v v v v v v v v are the input pairs of the check-node update unit.

Before sending the values to the bit-node update unit, we have to shuffle back the values. Thus we can have the correct answers.

Figure 5.7(a) Values shuffling before sending to check-node update unit

Figure 5.7(b) Values shuffling before sending to bit-node update unit CNU[15]

Check node update units (CNUs) are used to compute the check node equation.

The check-to-bit message r for the check node m l, m and bit node l using the incoming bit-to-check messages qm l, is computed by CNU as follows

, , , '

( )\

( ) min{ }

m l m l m l

l L m l

r sign q q

′∈

=

× (5.1)

where ( ) \L m l denotes the set of bit nodes connected to the check node m except l. Figure 5.8(a) shows the architecture of the CNU using the min-sum algorithm. The check node update unit has 6 inputs and 6 outputs. In Figure 5.8(a) and 5.8(b), the output of “MIN” is the minimal value of the 2 inputs. The aim of this circuit is to find the minimal value of the other 5 inputs. This architecture is quite straightforward.

Figure 5.8(b) shows the architecture of the CNU using the proposed modified min-sum algorithm.

Figure 5.8(a) The architecture of CNU using min-sum algorithm

Figure 5.8(b) The architecture of CNU using modified min-sum algorithm

The other way to implement equation (5.1) is to search the minimal value and the second minimal value from inputs. Figure 5.9 shows the block diagram of the compare-select unit (CS6). The detailed architecture of CMP-6 in Figure 5.9 is illustrated in Figure 5.10, which consists of two kinds of comparators: CMP-2 and CMP-4. CMP-4 finds out the minimal and the second minimal values from the four inputs, a, b, c , and d. In addition, CMP-2 is a two input comparator which is much simpler than CMP-4.

Figure 5.9 Block diagram of CS6 module

Figure 5.10(a) Block diagram of CMP-4 module

Figure 5.10(b) Block diagram of CMP-6 module

The whole architecture of the 6-input CNU is shown in Figure 5.11.

Figure 5.11 CNU architecture using min-sum algorithm

Table 5.1 compares the hardware performance of two different CNU architectures. We call the architecture in Figure 5.8(a) is direct CNU architecture and the architecture in Figure 5.11 is backhanded CNU architecture. We can find that the direct CNU architecture has only 45% size of the backhanded CNU architecture. So we choose the direct CNU architecture.

Table 5.1 Comparison of direct and backhanded CNU architectures

Direct CNU architecture Backhanded CNU architecture

Area (gate count) 0.52k 1.16k

Speed (MHz) 100 100

Power Consumption (mW)

4.82 10.85

BNU

Figure 5.12 shows the architecture of the bit node update unit for 4 inputs. “SM”

means the sign-magnitude representation and “2’s” means the two’s compliment representation. While finding the absolute minimal value of two inputs, sign-magnitude representation is more suitable for hardware implementation than two’s compliment. In contrast, while adding computation, two’s compliment representation is more suitable for hardware implementation than sign-magnitude representation.

Figure 5.12 The architecture of the bit node updating unit with 4 inputs

MMU0 and MMU1 [19]

In [19], it introduces a partial-parallel decoder architecture that can increase the decoder throughput with moderate decoder area. We adopt the partial-parallel architecture in our design and make an improvement in the message memory units.

Message memory units (MMU) are used to store the message values that are generated by CNUs and BNUs. To increase the decoding throughput, two MMUs are employed to concurrently process two different codewords in the decoder. The register exchange scheme based on four sub-blocks (RE-4B) is proposed as shown in Figure 5.13(a). In MMU, sub-blocks A, B, D capture the outputs from CNU while sub-blocks C and D deliver the message data to SHUFFLE2. The detailed timing diagram of MMU0 and MMU1 are illustrated in Figure 5.13(b). hxy(0) means the copied message of codeword 0 and hxy(1) means that of codeword 1.

Figure 5.13(a) The architecture of RE-4B based MMU

Figure 5.13(b) The timing diagram of the message memory units

While in the iterative decoding procedure, MMU0 and MMU1 pass messages to each other through SHUFFLE1, CNU, SHUFFLE2, and BNU modules. Disregarding the combinational circuit, the detailed relationship and snapshots between MMU0 and MMU1 is shown in Figure 5.14.

Figure 5.14 The message passing snapshots between MMU0 and MMU1

5.2 Hardware Performance Comparison and Summary

To compare the area, speed, latency, and power consumption of the architectures discussed in this section, we describe the hardware architectures in VHDL, and afterwards simulate and synthesize it using EDA tools SynopsisTM, PrimePower, and DesignAnalyzer. The process technology is UMC 0.18 mµ process. Table 5.2 lists the results of CNU using min-sum algorithm and the proposed modified min-sum algorithm.

Table 5.2 Area, speed, and power consumption of the CNU using min-sum algorithm and modified min-sum algorithm

6 input CNU 6 input CNU (modified)

7 input CNU 7 input CNU (modified) Area

(gate count)

0.52k 0.57k 0.72 0.79

Speed (MHz) 100 100 100 100

Power Consumption

(mW)

4.82 4.96 6.77 7.1

As mentioned before, two different codewords are processed concurrently without any stalls. In our proposed design, BNUs and CNUs have no idle time. Hence, it leads to an efficient utilization of the functional units. The design takes four cycles to complete a decoding iteration for each codeword, including two cycles for horizontal steps in CNUs and two cycles for vertical steps in BNUs. For channel value loading, each codeword takes two extra cycles. Since the maximum iteration of the decoding procedure is 10, the total amount of cycles needed to complete the decoding of two different codewords is 2+2+10*4=44 cycles. According to our initial synthesis results, the clock frequency is 100MHz, thus the data decoding throughput is 100*[1152*(1/2)]/44≈ 1.31 Gbps.

The proposed LDPC decoder is compared with other designs as listed in Table 5.3. The objective of our design is to devise a high throughput LDPC decoder with little chip area. Partial-parallel decoder architecture can meet our demand. Compared with [19], our design has lower data throughput. Because our decoder design has shorter code length and lower code rate. In our design, one codeword has 288 message bits. In [19], one codeword has 720 bits. Moreover considering the BER

performance, we choose the iteration number=10. This also reduces the data throughput. The superiority of our design is the chip area. Although we choose higher quantization bits, the chip area in our design has 82.6% of the design in [19] and 54.3% of the design in [17].

Table 5.3 Comparison of LDPC decoders Proposed LDPC

decoder

[19] [17]

Code length 576 1200 1024

Code rate 1/2 3/5 1/2

Quantization bits 7 6 4

Iteration number 10 8 10

Architecture Partial-parallel Partial-parallel Fully-parallel Process

Technology (μm) 0.18 0.18 0.16

Clock rate (MHz) 100 83 64

Power (mW) 620 644 690

Area (gate count) 950k 1150k 1750k

Throughput

(Mbps) 1310 3330 500

Chapter 6

Conclusions and Future Work

6.1 Conclusions

From this work, we summarize that using dynamic normalized-offset technique in LDPC decoder can further improve the error correction performance when compared with the conventional method. Various simulation results of LDPC decoder are investigated and the optimal choice considering the tradeoff between the hardware complexity and the performance have been discussed in this thesis.

In this thesis, with partial-parallel architecture, high-throughput and area-efficient LDPC code decoders are proposed for high-speed communication systems. A (576, 288) LDPC code in 802.16e standard has been implemented, of which the code rate is 1/2, the code length is 576 bits, and the maximum number of decoding iterations is 10. The LDPC decoder in our design can achieve a data throughput of 1.31 Gbps and the chip area is 950k gates using the UMC 0.18 mµ process technology.

6.2 Future Work

The normalization factor β and the offset factor α influence the decoder BER performance quite large. Through our research, we found that our proposed dynamic normalized-offset technique and dynamic normalization technique [23] have

similar BER decoding performance. The other idea is to dynamically adjust the two factors α and β in the same time. The threshold values of α and β may be obtained through simulations. Moreover, as mentioned in Appendix A, there are a lot of different codeword lengths and code rates in 802.16e standard. Our future work is to integrate the multi-mode 802.16e LDPC decoder design.

Appendix A

LDPC Codes Specification in IEEE 802.16e

OFDMA

The LDPC code in IEEE802.16e is a systematic linear block code, where k systematic information bits are encoded to n coded bits by adding m= −n k parity-check bits. The code-rate is k n/ .

The LDPC code in IEEE802.16e is defined based on a parity-check matrix H of size m n× that is expanded from a binary base matrix H with size b mb× , where nb m= ⋅z mb and n= ⋅ . In this standard, there are six different base matrices. One z nb for the rate 1/2 code is depicted in Figure A.1. Two different ones for two rate 2/3 codes, type A is in Figure A.2 and type B is in Figure A.3. Two different ones for two rate 3/4 codes, type A is in Figure A.4 and type B is in Figure A.5. One for the rate 5/6 code is depicted in Figure A.6. In these base matrices, size n is an integer equal to b 24 and the expansion factor z is an integer between 24 and 96. Therefore, we can compute the minimal code length as nmin =24 24 576× = bits and the maximum code length as nmax =24 96 2304× = bits.

For codes 1/2, 2/3B, 3/4A, 3/4B, and 5/6, the shift sizes ( , , )p f i j for a code size corresponding to the expansion factor z are derived from f p i j , which is the ( , ) element at the i-th row, j-th column in the base matrices, by scaling ( , )p i j proportionally as

0 permutation matrix. The permutation matrix represents a circular right shift by

( , , )

Figure A.1 Base matrix of the rate 1/2 code

Rate 2/3 A code:

Figure A.2 Base matrix of the rate 2/3, type A code

Rate 2/3 B code:

Figure A.3 Base matrix of the rate 2/3, type B code

Figure A.3 Base matrix of the rate 2/3, type B code