• 沒有找到結果。

JOINT CODE-ENCODER-DECODER DESIGN FOR LDPC CODING SYSTEM VLSI IMPLEMENTATION

N/A
N/A
Protected

Academic year: 2022

Share "JOINT CODE-ENCODER-DECODER DESIGN FOR LDPC CODING SYSTEM VLSI IMPLEMENTATION"

Copied!
4
0
0

加載中.... (立即查看全文)

全文

(1)

JOINT CODE-ENCODER-DECODER DESIGN FOR LDPC CODING SYSTEM VLSI IMPLEMENTATION

Hao Zhong and Tong Zhang

Electrical, Computer and Systems Engineering Department Rensselaer Polytechnic Institute, USA

ABSTRACT

This paper presents a design approach for low-density parity-check (LDPC) coding system hardware implementation by jointly con- ceiving irregular LDPC code construction and VLSI implementa- tions of encoder and decoder. The key idea is to construct good irregular LDPC codes subject to two constraints that ensure the ef- fective LDPC encoder and decoder hardware implementations. We propose a heuristic algorithm to construct such implementation- aware irregular LDPC codes that can achieve very good error cor- rection performance. The encoder and decoder hardware architec- tures are correspondingly presented.

1. INTRODUCTION

Low-density parity-check (LDPC) codes have received much at- tention because of their excellent error-correcting performance and highly parallelizable decoding algorithm. However, the effective VLSI implementations of the LDPC encoder and decoder remain a big challenge and a crucial issue in determining how well we can exploit the attractive merits of LDPC codes in real applications.

It has been well recognized that the conventional code to en- coder/decoder design strategy (i.e., first construct a code exclu- sively optimized for error-correcting performance, then implement the encoder and decoder for that code) is not applicable to LDPC coding system implementations. Consequently, joint design be- comes a key in most recent work [1–5]. However, two challenges still remain largely unsolved: (1) Complexity reduction and effec- tive VLSI architecture design for LDPC encoder remain largely unexplored; (2) Given the desired node degree distribution, no systematic method has ever been proposed to construct the code for hardware implementation. The current practice largely relies on handcraft, e.g., the code template presented in [2].

In this paper, we propose a joint code-encoder-decoder de- sign for irregular LDPC codes to tackle the above two challenges.

The key is implementation-aware irregular LDPC code construc- tion subject to two constraints that ensure effective encoder and decoder hardware implementation. A heuristic algorithm inspired by rules of thumb for constructing good LDPC code is proposed to construct the code. Encoder and decoder hardware architectures are correspondingly presented. To the best of our knowledge, this is the first complete solution for LDPC coding system implemen- tation in the open literature.

2. BACKGROUND

In this section, we summarize some important facts and state of the art in LDPC code construction and encoder/decoder design, which

directly inspired the joint design solution proposed in this paper.

LDPC Code Construction: To achieve good performance, LDPC codes should have the following properties: (a) Large code length:

The performance improves as the code length increases, and the code length cannot be too small (at least 1K); (b) Not too many small cycles: Too many small cycles in the code bipartite graph will seriously degrade the error-correcting performance; (c) Irreg- ular node degree distribution: It has been well demonstrated that carefully designed LDPC codes with irregular node degree distri- butions remarkably outperform regular ones.

LDPC Encoder: The straightforward encoding process using the generator matrix results in prohibitive VLSI implementation com- plexity. Richardson and Urbanke [6] demonstrated that, if the parity check matrix is approximate upper triangular, the encoding complexity can be significantly reduced. However, the encoding algorithm in [6] suffers from extensive usage of back-substitution operations that will increase the encoding latency and make ef- fective hardware implementation problematic. The authors of [4]

showed that all the back-substitution operations can be replaced by a few matrix-vector multiplications if the approximate upper trian- gular parity check matrix has the form as shown in Fig. 1, where I1and I2are identity matrices and O is a zero matrix.

g

I1 I2 O

Fig. 1. The encoder-aware parity check matrix structure.

LDPC Decoder: Most recently proposed LDPC decoder design schemes share the same property: The parity check matrix is a block structured matrix that can be partitioned into an array of square block matrices, each one is either a zero matrix or a cyclic shift of an identity matrix. Such block structured parity check ma- trix directly leads to effective decoder hardware implementations.

3. PROPOSED JOINT DESIGN APPROACH Motivated by the above summarized state of the art, we propose a joint code-encoder-decoder design as a complete solution for LDPC coding system implementations. In the following, we first present an implementation-aware code construction approach, then present the corresponding encoder and decoder design and hard- ware architectures.

,,

;‹,((( ,6&$6

(2)

3.1. Implementation-Aware Irregular Code Construction The basic idea is to build the parity check matrix of irregular LDPC code subject to two constraints: (1) It has an approximate upper triangular form as shown in Fig. 1 withg as small as possible; (2) It is a block structured matrix. These two constraints ensure the effective encoder and decoder hardware implementations.

The design challenge is how to, under the above two con- straints, construct good LDPC codes. This can be formulated as:

Given the code construction parameters, i.e., size of parity check matrix, size of each block matrix, node degree distribution1, and expected value of g, how to construct a good LDPC code? We present an approach to tackle this design challenge as follows.

Firstly, we note that, for irregular LDPC codes, the variable nodes with high degree tend to converge more quickly than those with low degree. Therefore, with finite number of decoding itera- tions, not all the small cycles in the code bipartite graph are equally harmful, i.e., those small cycles passing too many low-degree vari- able nodes degrade the performance more seriously than the oth- ers. Thus, it is intuitive that we should prevent small cycles from passing too many low-degree variable nodes. To this end, we in- troduce a concept of cycle degree:

Definition 3.1 We define the sum of degrees of all the variable nodes on a cycle as the cycle degree of this cycle.

It is intuitively desirable to make the cycle degree as large as possible for those unavoidable small cycles. Motivated by such in- tuition, we propose an algorithm, called Heuristic Block Padding (HBP), to construct LDPC codes subject to above two structural constraints, i.e., the parity check matrix has the the structure as shown in Fig. 2. The algorithm is described as follows:

Code construction parameters: The size of each block matrix isp× p, the size of parity check matrix is (m · p) × (n · p), and g = γ· p. The row and column weight distributions are {w(r)1 , w(r)2 ,· · · , w(r)m} and {w(c)1 , w(c)2 ,· · · , wn(c)}, where w(r)i andw(c)j represent the weight ofi-th block rows and j-th block columns, respectively.

Output:(m · p) × (n · p) parity check matrix H with the structure as shown in Fig. 2 , in which eachp× p block matrix Hi,jis either a zero matrix or a right cyclic shift of an identity matrix.

Procedure:

1. Generate an(m · p) × (n · p) matrix with the structure as shown in Fig. 2, where I1and I2are identity matrices with roughly the same size and O is a zero matrix. All the blocks in the un-shaded region are initially set as NULL blocks.

2. According to the column weight distribution, generate a set {a1, a2,· · · , an}, in which aj= w(c)j if1 ≤ j ≤ n − m + γ, and aj= w(c)j − 1 if n − m + γ + 1 ≤ j ≤ n.

3. According to the row weight distribution, generate a set {b1, b2,· · · , bm}, in which bi= w(r)i −1 if 1 ≤ i ≤ m−γ, andbi= w(r)i ifm− γ + 1 ≤ i ≤ m.

4. Initialize the cycle degree constraintd = dinit.

5. Forj = 1 to n, replace ajNULL blocks on thej-th block column withajright cyclic shifted identity matrices:

(a) Randomly picki∈ {1, 2, · · · , m} such that bi > 0 and Hi,jis a NULL block. Replace Hi,jwith a right

1Notice that the node degree distribution is equivalent to parity check matrix row and column weight distribution. The good distributions can be obtained using density evolution [7].

cyclic shift of ap× p identity matrix with randomly generated shift value.

(b) Letf (H) denote the minimum cycle degree in the bi- partite graph corresponding to the current matrix H.

Iff (H) < d or the bipartite graph contains 4-cycles, reject the replacement and go back to (a). Iff (H) re- mains less thand after a certain number of iterations, decreased by one before go back to (a).

(c) bi= bi− 1.

(d) Terminate and restart the procedure ifd < dmin, wheredminis the minimum allowable cycle degree.

6. Replace all the remaining NULL blocks with zero matrices and output the matrix H.

H1,1

Hm,1

I

1

Hm,n

p p

H1,n

I

2

O

H1,j

Hm,j Hi,j Hi,1

. . . . . .

. . . . . .

. . . . . .

.. . .. .

.. . .. .

. . .

. . . . . .

g g= p

(n-m) p

.

m p

. .

J

n p

.

Fig. 2. The parity check matrix H.

3.2. LDPC Encoder Design

In the following, we present an encoder design by exploiting the structural property of the code parity check matrix. We first de- scribe an encoding process, which is similar to that presented in [6]

but does not contain any back-substitution operations. Then we present the encoder hardware architecture design.

Encoding Process: According to Fig. 2, we can write the parity check matrix2as

H =

 A B T

C D E



, (1)

where A is(m · p − g) × ((n − m) · p), B is (m · p − g) × g, the upper triangular matrix T is(m · p − g) × (m · p − g), C is g× ((n − m) · p), D is g × g, and E is g × (m · p − g). Let [z1, z2, z3] be a codeword decomposed according to (1), where z1 is the information bit vector with the length of(n − m) · p, redundant parity check bit vectorz2 andz3 have the length ofg andm· p − g, respectively. Because of the structural property of the binary upper triangular matrix T, we can prove T=T−1. Fig. 3 shows the encoding flow diagram, whereΦ = −ETB + D.

In the encoding process, except the step of multiply byΦ−1, all the other steps perform multiplication between a sparse matrix and a vector. Although the complexity of multiply byΦ−1scales withg2, the value ofg can be very small compared to the matrix size. Thus the overall computational complexity of the encoding is much less than that of the encoding based on generator matrix.

2We assume that the parity check matrix is full rank, i.e., them · p rows are linearly independent. In our computer simulation, all the matrices constructed using the above HBP algorithm are full rank.

,,

(3)

z1T

multiply byA

multiply byT

1

)

multiply by multiply byE

multiply byT multiply byC

addition

addition

multiply byB

pipeline

z3 = T [Az1T+ Bz2T]

(XOR)

(XOR)

z2 =)1[ET Az1T+ Cz1T]

Fig. 3. Flow diagram of encoding process.

Encoder Architecture: The above encoding process mainly con- sists of six large sparse matrix-vector multiplications and one small dense matrix-vector multiplication. Directly mapping these large sparse matrix-vector multiplications to silicon can achieve very high speed but will suffer from significant logic gate and inter- connection complexities.

Leveraging the structural property of the parity check matrix, we propose an approach to trade the speed for complexity reduc- tion in the implementation of such large sparse matrix-vector mul- tiplications. Since each large sparse matrix is block structured, the matrix-vector multiplications can be written as:

⎢⎢

⎢⎣

U1,1 U1,2 . . . U1,s

U2,1 U2,2 . . . U1,s

... ... . . . ... Ut,1 Ut,2 . . . Ut,s

⎥⎥

⎥⎦

⎢⎢

⎢⎣ x1

x2

... xs

⎥⎥

⎥⎦=

⎢⎢

⎢⎣ y1

y2

... yt

⎥⎥

⎥⎦, (2)

where eachp× p block matrix Ui,j is either a zero matrix or a right cyclic shift of an identity matrix, and eachxj and yi are p× 1 vectors. Let the column and row weight distributions of matrix U be{q1, q2,· · · , qs} and {r1, r2,· · · , rt}, where qjand rirepresent the weights ofj-th block columns and i-th block rows.

To trade the speed for complexity reduction, we propose to perform such large sparse matrix-vector multiplication in a inter- vector-parallel/intra-vector-serial fashion: compute all thet vec- torsy1,y2,· · · , yt in parallel, but only 1 bit of each vector is computed at once. Define a setP = {(i, j)|∀ Ui,jis non-zero.}.

Since each non-zeroUi,jis a right cyclic shift of an identity ma- trix, we haveyi =

(i,j)∈Pxj[↑ di,j], where di,jis the right cyclic shift value ofUi,jandxj[↑ di,j] represents cyclic shifting up the vectorxjbydi,jpositions. To reduce the implementation complexity, we compute each vectoryibit by bit via sharing the same computational resource, i.e., anri-input XOR tree.

Fig. 4 shows a hardware architecture to implement the sparse matrix-vector multiplication in such inter-vector-parallel/intra-vector- serial fashion. Each input vectorxjand output vectoryiare stored in memory Xjand Yi, respectively. The entire matrix-vector mul- tiplication is completed inp clock cycles, each clock cycle it com- putest bits at the same position in the t vectorsy1,y2,· · · , yt.

Y1

Yt

address generators

input memory banks

hardwired interconnections

XOR trees

output memory banks

. . .

AG1,1

...

AG1,s

... ... ...

1

q1

(p bits)

(p bits)

1

qs

(p bits)

(p bits)

...

1

XOR 1 bit

1 r1

...

XOR 1 bit

log2p bits log2p bits

log2p bits

1 bit 1 bit

1 bit

1 bit

1 bit

1 bit

1 bit 1 bit

log2p bits

. . . . . . . . .

rt AGq ,1

1

AGq ,ss

... ...

X1

Xs

Fig. 4. Hardware design for sparse matrix-vector multiplication.

This demands that the memory banks Xj’s should provide|P| bits at the same position in the|P| vectors {xj[↑ di,j]|∀(i, j) ∈ P}.

To fulfill this requirement, each Xjprovidesqj1-bit outputs with addresses generated byqjaddress generators AG1,j,· · · , AGqj,j. Each address generator AGk,jis simply a binary counter which is initialized with a distinct value in{di,j|∀(i, j) ∈ P}.

As illustrated in Fig.3, the encoding is realized with 6-stage pipelining and the encoder contains six inter-vector-parallel/intra- vector-serial sparse matrix-vector multiplication blocks and one dense matrix-vector multiplication block that is directly mapped to silicon after logic minimization. To support the pipelining, we should double the size of input memory banks in each sparse matrix- vector multiplication block, i.e., two sets of input memory banks alternatively receive the output from the previous stage and pro- vide the data for current computation.

To estimate the encoder logic gate complexity in terms of the number of 2-input NAND gates, we count each 2-input XOR gate as three 2-input NAND gates and eachl-bit binary counter as 8l 2-input NAND gates. Assume the number of non-zero block ma- trices in sub-matrix T is2m and the small dense matrix-vector multiplication can be realized usingg2/6 2-input XOR gates. Let fEdenote the clock frequency of the encoder. We estimate the key metrics of this 6-stage pipelined encoder as follows:

User Data Rate Memory (bits) # of Gates (n − m) · fE (2n + m) · p + 3g 3 · |P| + g2/2 +

8 · log2p · |P|

3.3. LDPC Decoder Design

The LDPC code constructed above, whose parity check matrix has the structure as shown in Fig. 2, directly fits to a decoder architec- ture as illustrated in Fig. 5. It containsm check node computa- tion units (CNUs) andn variable node computation units (VNUs), which perform all the node computation in time-division multi- plexing fashion. The decoder usesn memory blocks to store the n·p channel input message and |P| memory blocks to store all the decoding message, recall that|P| is the total number of non-zero block matrices.

,,

(4)

|P| + n Memory Blocks

VNUn

CNUm CNUi

CNU1

VNUi

VNU1

. . . . . .

... ...

Fig. 5. Decoder architecture.

The message passing between variable and check nodes is jointly realized by memory addressing and hardwired interconnec- tion between memory blocks and node computation units. Since each non-zero block matrix is a right cyclic shift of an identity ma- trix, the access address for each memory block can be simply gen- erated by a binary counter. We note that this design strategy shares the same basic idea with the state of the art decoder design [1–3].

Given each decoding message quantized toq bits, we estimate that each CNU and VNU require320 · q and 250 · q gates (in terms of 2-input NAND gate), respectively. LetfD denote the clock frequency of the decoder and the average number of iterations is Davg. We estimate the key metrics of the decoder as:

User Data Rate Memory (bits) # of Gates

(n−m)·fD

2Davg (n + |P|) · p · q (320m + 250n) · q

4. AN EXAMPLE

Applying our proposed HBP algorithm, we constructed a rate-1/2, 8K irregular LDPC code. The column weights are 2, 3, 4, and 5, and the row weights are 6 and 7. Letm = 64, n = 128, p = 64, andγ = 3. We have each block matrix is 64× 64 and g = γ · p = 192. When constructing the code using HBP algorithm, we set the minimum allowable cycle degreedmin= 8. We simulate the code error-correcting performance by assuming the code is modulated by BPSK and transmitted over AWGN channel.

1 1.1 1.2 1.3 1.4

10Ŧ6 10Ŧ5 10Ŧ4 10Ŧ3 10Ŧ2 10Ŧ1 100

(a)

BER(FER)

Eb/N0(dB) ŦBER ŦFER

1 1.1 1.2 1.3 1.4

15 20 25 30 35 40 45

(b)

Average Number of Iterations

Eb/N0(dB)

Fig. 6. Simulation results.

Fig. 6 shows the simulated bit error rate (BER), frame error rate (FER) and the average number of iterations. We note that such error-correcting performance is better or comparable to the published results in the open literature.

The parity check matrix of the constructed rate-1/2, 8K code contains 404 non-zero block matrices. Denote the clock frequen- cies of encoder and decoder asfEandfD, respectively. Suppose each decoding message is quantized to 4 bits and the average num- ber of iterations is 20. Based on the key metrics estimation of the encoder and decoder listed in Sections 3.2 and 3.3, we have the following estimated key metrics of the coding system implemen- tations for this rate-1/2, 8K code:

LDPC User Data Rate Memory (bits) # of Gates

Encoder 64·fE 21K 38K

Decoder 1.6·fD 133K 205K

5. CONCLUSION

In this paper, we presented a joint code-encoder-decoder design approach for practical LDPC coding system hardware implemen- tations. The basic idea is implementation-aware LDPC code de- sign, which constructs irregular LDPC code subject to to two con- straints that ensure the effective LDPC encoder and decoder hard- ware implementations. A heuristic algorithm has been developed to perform the code construction aiming to optimize the error cor- rection performance. The efficient encoding process was described and a pipelined encoder hardware architecture was developed. The decoder hardware architecture is also presented. This proposed ap- proach for the first time provides a complete systematic solution for LDPC coding system hardware implementation.

6. REFERENCES

[1] M. M. Mansour, M. M. Mansour, and N. R. Shanbhag,

“A novel design methodology for high-performance pro- grammable decoder cores for AA-LDPC codes,” in IEEE Workshop on Signal Processing Systems (SiPS), Seoul, Korea, August 2003.

[2] D. E. Hocevar, “LDPC code construction with flexible hard- ware implementation,” in IEEE International Conference on Communications, 2003, pp. 2708 –2712.

[3] Y. Chen and D. Hocevar, “An FPGA and ASIC implemen- tation of rate 1/2 8088-b irregular low density parity check decoder,” in Proc. of Globecom, 2003.

[4] T. Zhang and K. K. Parhi, “Joint (3, k)-regular LDPC code and decoder/encoder design,” to appear IEEE Transactions on Signal Processing, 2003.

[5] E. Yeo, B. Nikolic, and V. Anantharam, “Architectures and implementation of low-density parity-check decoding algo- rithms,” in 45th IEEE Midwest Symposium on Circuits and Systems, August 2002, pp. 437–440.

[6] T. Richardson and R. Urbanke, “Efficient encoding of low- density parity-check codes,” IEEE Transactions on Informa- tion Theory, vol. 47, no. 2, pp. 638–656, Feb. 2001.

[7] T. Richardson, A. Shokrollahi, and R. Urbanke, “Design of capacity-approaching low-density parity-check codes,” IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 619–

637, Feb. 2001.

,,

參考文獻

相關文件

In this section, we present LDPC codes whose Tanner graphs of the associated parity-check matrix H exhibit a specific architecture that is stimulated by the structure of turbo

Although not as straightforward as in the previous case, due to the fact that for the BP algorithm the probability values considered at iteration- do not necessarily correspond to

The original concurrent decoding schedule has a large storage requirement that is dependent on the total number of edges in the underlying bipartite graph, while a new,

This column splitting results in a new sparse matrix and hence a new LDPC code of longer length. If column splitting is done properly, the extended code performs amazingly well

R ECENT advances [1], [2] in error correcting codes have shown that, using the message passing decoding algorithm, irregular low density parity-check (LDPC) codes can achieve

1, which is not needed in the case of trellis decoding by the Viterbi algorithm, computes soft information on LDPC code bits for subsequent soft iterative decoding..

Our approach relies on a combination of low-density parity check (LDPC) codes and low-density generator matrix (LDGM) codes, and produces sparse constructions that are

Abstract—We propose a system for magnetic recording, using a low density parity check (LDPC) code as the error-correcting-code, in conjunction with a rate