JOINT CODE-ENCODER-DECODER DESIGN FOR LDPC CODING SYSTEM VLSI IMPLEMENTATION

(1)

JOINT CODE-ENCODER-DECODER DESIGN FOR LDPC CODING SYSTEM VLSI IMPLEMENTATION

Hao Zhong and Tong Zhang

Electrical, Computer and Systems Engineering Department Rensselaer Polytechnic Institute, USA

ABSTRACT

This paper presents a design approach for low-density parity-check (LDPC) coding system hardware implementation by jointly con- ceiving irregular LDPC code construction and VLSI implementations of encoder and decoder. The key idea is to construct good irregular LDPC codes subject to two constraints that ensure the effective LDPC encoder and decoder hardware implementations. We propose a heuristic algorithm to construct such implementation- aware irregular LDPC codes that can achieve very good error cor- rection performance. The encoder and decoder hardware architectures are correspondingly presented.

1. INTRODUCTION

Low-density parity-check (LDPC) codes have received much at- tention because of their excellent error-correcting performance and highly parallelizable decoding algorithm. However, the effective VLSI implementations of the LDPC encoder and decoder remain a big challenge and a crucial issue in determining how well we can exploit the attractive merits of LDPC codes in real applications.

It has been well recognized that the conventional code to encoder/decoder design strategy (i.e., ﬁrst construct a code exclu- sively optimized for error-correcting performance, then implement the encoder and decoder for that code) is not applicable to LDPC coding system implementations. Consequently, joint design be- comes a key in most recent work [1–5]. However, two challenges still remain largely unsolved: (1) Complexity reduction and effec- tive VLSI architecture design for LDPC encoder remain largely unexplored; (2) Given the desired node degree distribution, no systematic method has ever been proposed to construct the code for hardware implementation. The current practice largely relies on handcraft, e.g., the code template presented in [2].

In this paper, we propose a joint code-encoder-decoder de- sign for irregular LDPC codes to tackle the above two challenges.

The key is implementation-aware irregular LDPC code construction subject to two constraints that ensure effective encoder and decoder hardware implementation. A heuristic algorithm inspired by rules of thumb for constructing good LDPC code is proposed to construct the code. Encoder and decoder hardware architectures are correspondingly presented. To the best of our knowledge, this is the ﬁrst complete solution for LDPC coding system implementation in the open literature.

2. BACKGROUND

In this section, we summarize some important facts and state of the art in LDPC code construction and encoder/decoder design, which

directly inspired the joint design solution proposed in this paper.

LDPC Code Construction: To achieve good performance, LDPC codes should have the following properties: (a) Large code length:

The performance improves as the code length increases, and the code length cannot be too small (at least 1K); (b) Not too many small cycles: Too many small cycles in the code bipartite graph will seriously degrade the error-correcting performance; (c) Irreg- ular node degree distribution: It has been well demonstrated that carefully designed LDPC codes with irregular node degree distributions remarkably outperform regular ones.

LDPC Encoder: The straightforward encoding process using the generator matrix results in prohibitive VLSI implementation complexity. Richardson and Urbanke [6] demonstrated that, if the parity check matrix is approximate upper triangular, the encoding complexity can be signiﬁcantly reduced. However, the encoding algorithm in [6] suffers from extensive usage of back-substitution operations that will increase the encoding latency and make effective hardware implementation problematic. The authors of [4]

showed that all the back-substitution operations can be replaced by a few matrix-vector multiplications if the approximate upper triangular parity check matrix has the form as shown in Fig. 1, where I₁and I₂are identity matrices and O is a zero matrix.

g

I₁ I₂ O

Fig. 1. The encoder-aware parity check matrix structure.

LDPC Decoder: Most recently proposed LDPC decoder design schemes share the same property: The parity check matrix is a block structured matrix that can be partitioned into an array of square block matrices, each one is either a zero matrix or a cyclic shift of an identity matrix. Such block structured parity check matrix directly leads to effective decoder hardware implementations.

3. PROPOSED JOINT DESIGN APPROACH Motivated by the above summarized state of the art, we propose a joint code-encoder-decoder design as a complete solution for LDPC coding system implementations. In the following, we ﬁrst present an implementation-aware code construction approach, then present the corresponding encoder and decoder design and hardware architectures.

,,

;,((( ,6&$6

(2)

3.1. Implementation-Aware Irregular Code Construction The basic idea is to build the parity check matrix of irregular LDPC code subject to two constraints: (1) It has an approximate upper triangular form as shown in Fig. 1 withg as small as possible; (2) It is a block structured matrix. These two constraints ensure the effective encoder and decoder hardware implementations.

The design challenge is how to, under the above two constraints, construct good LDPC codes. This can be formulated as:

Given the code construction parameters, i.e., size of parity check matrix, size of each block matrix, node degree distribution¹, and expected value of g, how to construct a good LDPC code? We present an approach to tackle this design challenge as follows.

Firstly, we note that, for irregular LDPC codes, the variable nodes with high degree tend to converge more quickly than those with low degree. Therefore, with ﬁnite number of decoding iterations, not all the small cycles in the code bipartite graph are equally harmful, i.e., those small cycles passing too many low-degree variable nodes degrade the performance more seriously than the oth- ers. Thus, it is intuitive that we should prevent small cycles from passing too many low-degree variable nodes. To this end, we in- troduce a concept of cycle degree:

Deﬁnition 3.1 We deﬁne the sum of degrees of all the variable nodes on a cycle as the cycle degree of this cycle.

It is intuitively desirable to make the cycle degree as large as possible for those unavoidable small cycles. Motivated by such in- tuition, we propose an algorithm, called Heuristic Block Padding (HBP), to construct LDPC codes subject to above two structural constraints, i.e., the parity check matrix has the the structure as shown in Fig. 2. The algorithm is described as follows:

Code construction parameters: The size of each block matrix isp× p, the size of parity check matrix is (m · p) × (n · p), and g = γ· p. The row and column weight distributions are {w^(r)₁ , w^(r)₂ ,· · · , w^(r)m} and {w^(c)₁ , w^(c)₂ ,· · · , wn^(c)}, where w^(r)_i andw^(c)_j represent the weight ofi-th block rows and j-th block columns, respectively.

Output:(m · p) × (n · p) parity check matrix H with the structure as shown in Fig. 2 , in which eachp× p block matrix Hi,jis either a zero matrix or a right cyclic shift of an identity matrix.

Procedure:

1. Generate an(m · p) × (n · p) matrix with the structure as shown in Fig. 2, where I₁and I₂are identity matrices with roughly the same size and O is a zero matrix. All the blocks in the un-shaded region are initially set as NULL blocks.

2. According to the column weight distribution, generate a set {a1, a2,· · · , an}, in which aj= w^(c)_j if1 ≤ j ≤ n − m + γ, and aj= w^(c)_j − 1 if n − m + γ + 1 ≤ j ≤ n.

3. According to the row weight distribution, generate a set {b1, b2,· · · , bm}, in which bi= w^(r)_i −1 if 1 ≤ i ≤ m−γ, andbi= w^(r)_i ifm− γ + 1 ≤ i ≤ m.

4. Initialize the cycle degree constraintd = dinit.

5. Forj = 1 to n, replace ajNULL blocks on thej-th block column withajright cyclic shifted identity matrices:

(a) Randomly picki∈ {1, 2, · · · , m} such that bi > 0 and H_i,jis a NULL block. Replace H_i,jwith a right

1Notice that the node degree distribution is equivalent to parity check matrix row and column weight distribution. The good distributions can be obtained using density evolution [7].

cyclic shift of ap× p identity matrix with randomly generated shift value.

(b) Letf (H) denote the minimum cycle degree in the bi- partite graph corresponding to the current matrix H.

Iff (H) < d or the bipartite graph contains 4-cycles, reject the replacement and go back to (a). Iff (H) re- mains less thand after a certain number of iterations, decreased by one before go back to (a).

(c) bi= bi− 1.

(d) Terminate and restart the procedure ifd < dmin, wheredminis the minimum allowable cycle degree.

6. Replace all the remaining NULL blocks with zero matrices and output the matrix H.

H_1,1

Hm,1

I

₁

H_m,n

p p

H_1,n

I

₂

O

H_1,j

H_m,j H_i,j H_i,1

. . . . . .

.. . .. .

. . .

. . . . . .

g g= p

(n-m) p

.

m p

. .

J

n p

.

Fig. 2. The parity check matrix H.

3.2. LDPC Encoder Design

In the following, we present an encoder design by exploiting the structural property of the code parity check matrix. We ﬁrst de- scribe an encoding process, which is similar to that presented in [6]

but does not contain any back-substitution operations. Then we present the encoder hardware architecture design.

Encoding Process: According to Fig. 2, we can write the parity check matrix²as

H =

A B T

C D E

, (1)

where A is(m · p − g) × ((n − m) · p), B is (m · p − g) × g, the upper triangular matrix T is(m · p − g) × (m · p − g), C is g× ((n − m) · p), D is g × g, and E is g × (m · p − g). Let [z₁, z2, z3] be a codeword decomposed according to (1), where z₁ is the information bit vector with the length of(n − m) · p, redundant parity check bit vectorz2 andz3 have the length ofg andm· p − g, respectively. Because of the structural property of the binary upper triangular matrix T, we can prove T=T⁻¹. Fig. 3 shows the encoding ﬂow diagram, whereΦ = −ETB + D.

In the encoding process, except the step of multiply byΦ⁻¹, all the other steps perform multiplication between a sparse matrix and a vector. Although the complexity of multiply byΦ⁻¹scales withg², the value ofg can be very small compared to the matrix size. Thus the overall computational complexity of the encoding is much less than that of the encoding based on generator matrix.

2We assume that the parity check matrix is full rank, i.e., them · p rows are linearly independent. In our computer simulation, all the matrices constructed using the above HBP algorithm are full rank.

,,

(3)

z₁^T

multiply byA

multiply byT

1

)

multiply by multiply byE

multiply byT multiply byC

addition

multiply byB

pipeline

z₃ = T [Az₁^T+ Bz₂^T]

(XOR)

z₂ =⁾¹[ET Az₁^T+ Cz₁^T]

Fig. 3. Flow diagram of encoding process.

Encoder Architecture: The above encoding process mainly con- sists of six large sparse matrix-vector multiplications and one small dense matrix-vector multiplication. Directly mapping these large sparse matrix-vector multiplications to silicon can achieve very high speed but will suffer from signiﬁcant logic gate and inter- connection complexities.

Leveraging the structural property of the parity check matrix, we propose an approach to trade the speed for complexity reduction in the implementation of such large sparse matrix-vector multiplications. Since each large sparse matrix is block structured, the matrix-vector multiplications can be written as:

⎡

⎢⎢

⎢⎣

U1,1 U1,2 . . . U1,s

U2,1 U2,2 . . . U1,s

... ... . . . ... Ut,1 Ut,2 . . . Ut,s

⎤

⎥⎥

⎥⎦

⎡

⎢⎢

⎢⎣ x1

x2

... xs

⎤

⎥⎥

⎥⎦=

⎡

⎢⎢

⎢⎣ y1

y2

... yt

⎤

⎥⎥

⎥⎦, (2)

where eachp× p block matrix Ui,j is either a zero matrix or a right cyclic shift of an identity matrix, and eachxj and yi are p× 1 vectors. Let the column and row weight distributions of matrix U be{q1, q₂,· · · , qs} and {r1, r₂,· · · , rt}, where qjand rirepresent the weights ofj-th block columns and i-th block rows.

To trade the speed for complexity reduction, we propose to perform such large sparse matrix-vector multiplication in a inter- vector-parallel/intra-vector-serial fashion: compute all thet vec- torsy1,y2,· · · , yt in parallel, but only 1 bit of each vector is computed at once. Deﬁne a setP = {(i, j)|∀ Ui,jis non-zero.}.

Since each non-zeroUi,jis a right cyclic shift of an identity matrix, we haveyi =

(i,j)∈Pxj[↑ di,j], where di,jis the right cyclic shift value ofUi,jandxj[↑ di,j] represents cyclic shifting up the vectorxjbydi,jpositions. To reduce the implementation complexity, we compute each vectoryibit by bit via sharing the same computational resource, i.e., anri-input XOR tree.

Fig. 4 shows a hardware architecture to implement the sparse matrix-vector multiplication in such inter-vector-parallel/intra-vector- serial fashion. Each input vectorxjand output vectoryiare stored in memory X_jand Y_i, respectively. The entire matrix-vector multiplication is completed inp clock cycles, each clock cycle it com- putest bits at the same position in the t vectorsy1,y2,· · · , yt.

Y₁

Y_t

address generators

input memory banks

hardwired interconnections

XOR trees

output memory banks

. . .

AG_1,1

...

AG_1,s

... ... ...

1

q₁

(p bits)

1

q_s

(p bits)

...

1

XOR 1 bit

1 r₁

...

XOR 1 bit

log₂p bits log₂p bits

log₂p bits

1 bit 1 bit

1 bit

1 bit 1 bit

log₂p bits

. . . . . . . . .

r_t AG_{q ,1}

1

AG_{q ,s}_s

... ...

X₁

X_s

Fig. 4. Hardware design for sparse matrix-vector multiplication.

This demands that the memory banks Xj’s should provide|P| bits at the same position in the|P| vectors {xj[↑ di,j]|∀(i, j) ∈ P}.

To fulﬁll this requirement, each Xjprovidesqj1-bit outputs with addresses generated byqjaddress generators AG_1,j,· · · , AGqj,j. Each address generator AG_k,jis simply a binary counter which is initialized with a distinct value in{di,j|∀(i, j) ∈ P}.

As illustrated in Fig.3, the encoding is realized with 6-stage pipelining and the encoder contains six inter-vector-parallel/intra- vector-serial sparse matrix-vector multiplication blocks and one dense matrix-vector multiplication block that is directly mapped to silicon after logic minimization. To support the pipelining, we should double the size of input memory banks in each sparse matrix- vector multiplication block, i.e., two sets of input memory banks alternatively receive the output from the previous stage and provide the data for current computation.

To estimate the encoder logic gate complexity in terms of the number of 2-input NAND gates, we count each 2-input XOR gate as three 2-input NAND gates and eachl-bit binary counter as 8l 2-input NAND gates. Assume the number of non-zero block ma- trices in sub-matrix T is2m and the small dense matrix-vector multiplication can be realized usingg²/6 2-input XOR gates. Let fEdenote the clock frequency of the encoder. We estimate the key metrics of this 6-stage pipelined encoder as follows:

User Data Rate Memory (bits) # of Gates (n − m) · fE (2n + m) · p + 3g 3 · |P| + g²/2 +

8 · log₂p · |P|

3.3. LDPC Decoder Design

The LDPC code constructed above, whose parity check matrix has the structure as shown in Fig. 2, directly ﬁts to a decoder architecture as illustrated in Fig. 5. It containsm check node computation units (CNUs) andn variable node computation units (VNUs), which perform all the node computation in time-division multi- plexing fashion. The decoder usesn memory blocks to store the n·p channel input message and |P| memory blocks to store all the decoding message, recall that|P| is the total number of non-zero block matrices.

,,

(4)

|P| + n Memory Blocks

VNU_n

CNU_m CNUi

CNU1

VNUi

VNU1

. . . . . .

... ...

Fig. 5. Decoder architecture.

The message passing between variable and check nodes is jointly realized by memory addressing and hardwired interconnec- tion between memory blocks and node computation units. Since each non-zero block matrix is a right cyclic shift of an identity matrix, the access address for each memory block can be simply generated by a binary counter. We note that this design strategy shares the same basic idea with the state of the art decoder design [1–3].

Given each decoding message quantized toq bits, we estimate that each CNU and VNU require320 · q and 250 · q gates (in terms of 2-input NAND gate), respectively. LetfD denote the clock frequency of the decoder and the average number of iterations is Davg. We estimate the key metrics of the decoder as:

User Data Rate Memory (bits) # of Gates

(n−m)·fD

2Davg (n + |P|) · p · q (320m + 250n) · q

4. AN EXAMPLE

Applying our proposed HBP algorithm, we constructed a rate-1/2, 8K irregular LDPC code. The column weights are 2, 3, 4, and 5, and the row weights are 6 and 7. Letm = 64, n = 128, p = 64, andγ = 3. We have each block matrix is 64× 64 and g = γ · p = 192. When constructing the code using HBP algorithm, we set the minimum allowable cycle degreedmin= 8. We simulate the code error-correcting performance by assuming the code is modulated by BPSK and transmitted over AWGN channel.

1 1.1 1.2 1.3 1.4

10^Ŧ6 10^Ŧ5 10^Ŧ4 10^Ŧ3 10^Ŧ2 10^Ŧ1 10⁰

(a)

BER(FER)

Eb/N0(dB) ŦBER ŦFER

1 1.1 1.2 1.3 1.4

15 20 25 30 35 40 45

(b)

Average Number of Iterations

Eb/N0(dB)

Fig. 6. Simulation results.

Fig. 6 shows the simulated bit error rate (BER), frame error rate (FER) and the average number of iterations. We note that such error-correcting performance is better or comparable to the published results in the open literature.

The parity check matrix of the constructed rate-1/2, 8K code contains 404 non-zero block matrices. Denote the clock frequen- cies of encoder and decoder asfEandfD, respectively. Suppose each decoding message is quantized to 4 bits and the average number of iterations is 20. Based on the key metrics estimation of the encoder and decoder listed in Sections 3.2 and 3.3, we have the following estimated key metrics of the coding system implementations for this rate-1/2, 8K code:

LDPC User Data Rate Memory (bits) # of Gates

Encoder 64·fE 21K 38K

Decoder 1.6·fD 133K 205K

5. CONCLUSION

In this paper, we presented a joint code-encoder-decoder design approach for practical LDPC coding system hardware implementations. The basic idea is implementation-aware LDPC code design, which constructs irregular LDPC code subject to to two constraints that ensure the effective LDPC encoder and decoder hardware implementations. A heuristic algorithm has been developed to perform the code construction aiming to optimize the error cor- rection performance. The efﬁcient encoding process was described and a pipelined encoder hardware architecture was developed. The decoder hardware architecture is also presented. This proposed approach for the ﬁrst time provides a complete systematic solution for LDPC coding system hardware implementation.

6. REFERENCES

[1] M. M. Mansour, M. M. Mansour, and N. R. Shanbhag,

“A novel design methodology for high-performance pro- grammable decoder cores for AA-LDPC codes,” in IEEE Workshop on Signal Processing Systems (SiPS), Seoul, Korea, August 2003.

[2] D. E. Hocevar, “LDPC code construction with ﬂexible hard- ware implementation,” in IEEE International Conference on Communications, 2003, pp. 2708 –2712.

[3] Y. Chen and D. Hocevar, “An FPGA and ASIC implementation of rate 1/2 8088-b irregular low density parity check decoder,” in Proc. of Globecom, 2003.

[4] T. Zhang and K. K. Parhi, “Joint (3, k)-regular LDPC code and decoder/encoder design,” to appear IEEE Transactions on Signal Processing, 2003.

[5] E. Yeo, B. Nikolic, and V. Anantharam, “Architectures and implementation of low-density parity-check decoding algo- rithms,” in 45th IEEE Midwest Symposium on Circuits and Systems, August 2002, pp. 437–440.

[6] T. Richardson and R. Urbanke, “Efﬁcient encoding of low- density parity-check codes,” IEEE Transactions on Informa- tion Theory, vol. 47, no. 2, pp. 638–656, Feb. 2001.

[7] T. Richardson, A. Shokrollahi, and R. Urbanke, “Design of capacity-approaching low-density parity-check codes,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 619–

637, Feb. 2001.

,,