LDPC Over

(1)

The 2006 IEEEInternationalSymposium onCircuitsand Systems (ISCAS 2006), May 21-24, 2006

Encoder Architecture with Throughput Over 10 Gbit/sec for Quasi-cyclic LDPC Codes

Zhiyong He, Se6bastien Roy, and Paul Fortier

Departmentof Electrical and ComputerEngineering,LavalUniversity, Quebec City, Quebec,Canada, GIK 7P4

Abstract-This paper discusses the design of a high-speed parity-check matrix. To reduce memory requirements, encoder for low density parity check (LDPC) codes. To several classes of structured LDPC codes have been minimize hardware costs and memory requirements of such proposed, such as LDPC codes based on quasi-cyclic (QC) encoders, a class of high-performance quasi-cyclic LDPC codes codes [9, 10]. By employing the efficient encoding which can be encoded in linear time has been proposed by algorithms discussed in

[8],

the QC LDPC codes exhibit an designing the parity check matrix in a triangular plus dual- ^R 2

diagonal form. Based on the proposed codes, parallel encoding complexity of + ) n2|,where R, n and architectures and

pipelining technology

have been used to

increase the throughput of encoders. Moreover, collisions

which occur whenparallel processors contend for write access J denote code rate, block length and column weight, to the same memory module are avoided by exploiting an respectively. For example, for a QC LDPC code with iterative encoding approach which involves repeated usage of columnweight 3 and code rate 0.5, the encoding scheme has the processors. The

implementation

results into field a

complexity

^of

0(n

⁺

n2 /6),

which is still

unacceptable.

programmable gate array (FPGA) devices indicate that the

encoder for the LDPC code with a blocklength of 2048 and a In this paper,wepresentaspecialclass ofQCLDPC code rate of 0.5 attains a throughput of 12.8 Gbit/s using 352 codes which have an encoding complexity of 0(n). By exclusive-ORgates.

designing

the

parity

check

matrix in

a

triangular

plus dual- diagonal form, the proposed LDPC codescanbe encodedin I. INTRODUCTION linear time. The hardware

implementation

results indicates Becauseof theircapacity-achieving performance and the that the encoders for the proposed LDPC codes attain existence of effective

decoding schemes, low-density parity- throughputs

^of ^more than 10 GBit/s

by using only

^several

check (LDPC) codes introduced by Gallager [1] have hundreds

exclusive-OR

(XOR) gates.

recently received a lot of interest for reliable high speed

communication applications such as long-haul optical II. CODE CONSTRUCTION

channels [2] and magnetic storage [3]. Several standards This section discusses the code construction of LDPC have been recently proposed which suggested the codeswith a column weight of 4. Extension of the proposed incorporation of LDPC codes: Gigabit Ethernet [4], concept to LDPC codes with arbitrary

column

weight is broadband wireless access networks

[5],

and

Digital

Video straightforward.

Broadcast (DVB) satellite communications [6]. For

application in long-haul optical channels, which support A. Proposed LDPC codes data transmitting rates of 20

Gbits/s,

and in

Gigabit

Ethernet, whichare expectedtoprovide adatarateup to 10 A (4, L)

-regular structured

LDPC code is defined as a Gbits/s, high throughput of LDPC codes is a critical issue. codeFig. 1,representedinwhich each column hasby the parity-checkweightmatrix4and eachH shown inrowhas Based on parallelizable decoding algorithms, weightL. In Fig. 1, the matrices I and 0 are the qxq identity decoders for LDPC codescan attaina

throughput

of several matrix and null matrix, respectively, where q is a positive Gbits/s [7]. However, the

encoding problem

becomes an integer. For 1<j<4 and 1 < 1 <

L,

the submatrix

Hi, j,

obstacle forhigh-speed applications because the

complexity

of

encoding

is

quadratic

in the block

length. By randomly

where

Hij A(Cj1),

in position

(j, 1)

within H is a

constructing

the

parity

check matrix in an

approximate

circulant matrix obtained by cyclically shifting the rows of

triangular form,

Richardson et al. reduced the

encoding

matrixA to the right by

Cj,

^places. Matrix A is either I or

complexity

^to

0(n

⁺g

), ^where

ⁿ

^is ^{the block} length ^and

^g ^a

permutation

matrix of

I.

The coefficient

Cj1,

where is the gap between the given parity check matrix to a

triangular

matrix

[8].

For

randomly-constructed

LDPC

1. Cj,

.q, is chosen randomly with the 4-cycle-free

codes,

a

significant

amountofmemoryis neededto storethe constraints. When matrix A is the identity matrix

I,

the

0-7803-9390-2/06/$20.00

©C2006

IEEE 3269 ISCAS 2006

(2)

The 2006 IEEEInternationalSymposium on Circuitsand Systems (ISCAS 2006), May 21-24, 2006

proposed codes are a class of QC LDPC codes. The The biterror rate (BER) and the frame error rate (FER) minimum Hamming distances dH of QC LDPC codes with oftheproposed code with acolumn weight of4 versus the column weightJhavean upperbound of (J+1)! [11], e.g., signal-to-noise ratio (SNR)perbit EbI

No

are compared in dH =24when J=3. To increase dH, A is replaced by a

Fig.

2 with the LDPC code

proposed by

Intel Inc.

[5],

the randompermutation matrix ofIfor J=3. so-called Intel code. The systematic parts of the parity check matrices for thetwocodes have ^a

regular

column

weight

of

H

=

H

H H H

H

4. Since the parity check matrix for the Intel code has a

L Hs Pi P2 P3 P4

dual-diagonal

form, all columns in the

parity part

have a

H1,1 ^...Hl,(L-4)I 0 0 0 uniform column

weight

of 2. It is shown

clearly

ⁱⁿ

Fig.

²

H21..

^Hi

?1L4)

^H ^I ^O ⁰ ^{that the}

^proposed

^codes

dramatically outperform

the Intel

~1,

^HS

_3~L~ HfLL-2

^H2~L-3~^2(4,^I^I ⁰ ⁰

^code,

^{the latter}

having

^a

dual-diagonal

form. For the

1H41 ..H4Ai H341-3 H4

^I

^L-2)j

proposed code shown in Fig. 2, q⁼100 was chosen for each submatrix. The size q can be chosen as a power of 2 Fig. 1.Proposed parity-check matrix. Hs is comprised for the benefit of simple hardware implementation, e.g., ofa 4x(L -4)array of circulant submatrices. q =128 or 256. Shown in

Fig.

3 are the

proposed

codes withacolumn weight of4and different coderates. Since a Having the parity check matrixin a triangular form, the uniform size q = 256 is chosen, the three codes with code proposed LDPC codesarelinearly encodable. To removethe rates 0.5, 0.6 and 2/3 have block lengths 2048, 2560 and columns of weight 1 which may cause an error floor, the 3072

respectively.

submatrixatthelower-rightcorner in Hhas adual-diagonal form. On the other hand, matrices

Hp,

and

HP2

within H

have a column weight of4 and 3, respectively. Thus, in H

10|o2

for the proposed codes, 2q, q and q(L -3) columns have

weights 2,

3 and

4, respectively.

Inorderto

support

a

layered

decoding algorithm, the dual-diagonal matrix at the lower- rightcorner ofH canbedecomposed into four sub-matrices

I (q l) ,61 X

by

row-column

permutation.

The n=2048,R=0.5

n=2560,R=0.6

proposed code hasacoderateofR=

(1

-JI

L). 10-8

-e- n=3072, R=2/3

1 1.5 2 Eb/NO 2.5 3

B. Performanceanalysis

To evaluate the proposed LDPC codes, we performed

simulations assuming binary phase-shift keying

(BPSK)

Fig. 3. BERperformances of LDPC codes modulation and an AWGN channel. The iterative belief- with a block length n and code rate R.

propagation (BP) algorithm was used fordecoding. TheBP decoder stops when either a valid codeword is foundor 50

decoding iterations arereached. A.

Encoding algorithm

Since LDPC codes are linear codes, ^x is acodeword if

100

and only if

S u Blocklength: 800 H XT=

Coderate :0.5 (1)

In

Fig. 4(a),

^xis

split

into 5

parts,

i.e. x =

(u, PI, P2,

P3,

lY >*sss

^p4), ^where ^u denotes the systematic part, andPl, P2, p3, and

IL- S'SEss,

p4

denote the

parity parts.

The above

equation

is

split

rt

^,0-4L^,

^{EL .} ^I ^naturally

^{into 3}

^equations

^as^follows:

0~ io0'

w

mP

=

H

UT (2)

-- FER Intel T TT

10-- FERProposed P

HS2U

+

H2,L-3

^I

⁽³⁾

-$-- BERIntel22L3P

-E- BERProposed]

Fig. 2. Bit-erro -rate and frame-error-rate compari5sons

T

H 3 ] LH_ [1

.5 2

Eb/N+

253T35

+- -H1

I T

H3,L2]P

^T

between the Intel code and the proposed

code.(4

3270

(3)

The 2006 IEEEInternational Symposium on Circuits and Systems(ISCAS2006),May21-24, 2006

Let usdefine anintermediatecolumn vector propagation time from input signal to output signal. Since

HP34

is adual-diagonalmatrix, the XOR operations for

HP34 VT [v(1), v(2),..., v(4q)]T

⁼

HSP[U ⁺ ^PI ⁺ ^P2 ]T, (5) are implemented by adding the intermediate bit v and the

bit

p(i -1). Fig. 5(b)

shows the processor with5 inputsand 4 outputsused for the XOR processing of

H

Si ^O ^O

^HP34.

efficiently with several processors.Pipelining is exploited to perform this XOR step

HS2 H2,L-3

HSP

= H H H (6) B.

Parallel

architectures and iterative

technology

S3 3,L-3 3,L-2 To increase the

encoding speed,

XOR

processing

for

HS4 H4,L-3 H4,L-2_ Hsp

^can be

performed

ⁱⁿ

parallel by using repeatedly

^M

processors fort

times,

where t=mI M andmis thenumber The resulting structure is shown in Fig. 4(a). Combining

2p

(2)-(4) with (5), the encoding equations are defined as of rowsinHsp. Then, q processors with k+ 1 inputs and

follows: k

outputs

^areusedforthe XORkxt

processing

ⁱⁿ

HP34,

where

2q

P1 (i)

=

v(i), (7)

is the number ofrowsin

HP34.

P2(i)=v(i+(q), (8)

1

p3(i)=v(i+2q)+p3(i-1), (9)

p4(i)

V(i

+

3q)

+p4(i-1),

(10)

where 1< i<q.

Ui

(a)

HS2

HZfL~3)

^I ⁰ ⁰ ^X p2T =

F a7

HS3 H3,(L-3) H3i(Lr) in P3t -4 - outpu

H58 H4A(L3 H4LJ2 p4T 0 P4

Matres

⁽ ^P As an

intuitive

^example, consider a code with a block

lengthof 128 and a coderate of 0.5. Matrices Hsp andHp34

Hsi

Puu

have 64 and 32 rows, respectively. When 16 processors are

Ttused for

Hsp, t = 4 clock cycles are needed, i.e. the rows

1OR ^O PI ranging from 1 to 16 are processed in parallel at the first

operation XOR |clock cycle, while the rows ranging from 49 to 64 are

u

Matris

for ^H F ^H lpoperatlion

^Hp34

P3processed at the 4-th clock cycle. The XOR processing in^sH34^{can be}

^performed

^{by using 2} ^processors ^with⁵ ^inputs (b) have 64and4 outputs, i.e. the

rows

rangingfrom 1 toHp34

16

in ^are

processed

by the first processor in 4 clock

cycles,

and the Fig. 4. (a) Encodingequation H x = 0. Thecodeword x rows

ranging

from 17 to 32 are processed by the second is split into 5parts, i.e. x = (u,

Pi,

P2, P3^

p4).

(b) Block processor.

diagramofthe encoder for the proposed LDPC code.

The

proposed parallel

architectures, pipelining The

block

diagram of the encodingprocess is displayed technology,anditerative encoding approach areapplicableto in Fig.

4(b). First,

the systematic bits vector u is multiplied other LDPC codes having an arbitrary parity check matrix.

by

Hsp

to

obtain

the parity bits

Pt

and P2.

Then,

the vector For example, consider an mxn matrix H in a dual-diagonal (u,

Pdi

P2) is multiplied by Hsp to obtain the intermediate form Lett be thenumberofiterations. The mit processors vector v. Since Hsp is a sparse matrix with row weight L-2, can be used to perform XOR processing in parallel for the

the multiplication

is simply implemented with L-3 XOR systematic part of H, while mlI(k

xit)

processor with

(k+l) gates. Fig. 5(a)

shows

the processor

with 10-inputs-one- inputs and koutputs are used to perform XORprocessing in

output

used for the XOR processing

of

Hsp

having

a row

the

parity part.

weight of 10. A tree architecture is employed to reduce the

3271

(4)

The 2006 IEEEInternationalSymposium onCircuitsandSystems (ISCAS2006),May21-24,2006

IV. HARDWAREIMPLEMENTATION encodable and have better memory efficiency. Parallel Aseries of encoders for LDPC codes with various block encoding and pipelining structures were exploited to lengths and various code rates were implemented into increase encoding throughput. To avoid memory access Xilinx Virtex-II Field Programmable Gate Array (FPGA) collisions when parallel processors contend for access to the devices. The systematic bits u were stored intothe multiple- same module, an

iterative

encoding approach was proposed.

port

Random Access Memory (RAM) with one port for The advantages in terms of hardware savings and encoding writing and multiple ports for reading. For FPGA devices

throughputs

for the

proposed

LDPC codes have been which only provide dual-port RAM, the bits u were stored characterized by implementing a series of encoders into a into four sets of RAM. The intermediate bits vwere stored XilinxFPGAdevice. For an LDPC code with ablock length into the dual-port RAM with one port for writing and one ofseveralthousandsbits, theencoder attains a throughput of port for reading. Each dual-port or multiple-port RAM is more than 10 Gbit/s using only several hundreds XOR configured as a 16 x 1-bitmodule with 16addresses so that gates. Thisdemonstrates that theproposedLDPC codes are M modules can store 16M bits. A total of n bits were suitable for high-speed applications in the Gbit/s region, needed, divided up into n/16 modules, for a LDPC code such as long-haul optical channels and Gigabit Ethernet.

withablocklength of n. AcKNOWLEDGMENT

With the multi-port

configuration,

several processorscan

read

simultaneously

the data from thesameRAMmodulein Engineering Research Council ofCanada (NSERC) and Le each clock cycle, but only one processor canwritedatainto

Eng

^e

Reeosdea Cucilrof Canada(nSeRC

^{an le}

a

given

module. To avoid write access

collisions,

the Fonds

quobocois

de la recherche sur la nature et les number

of iterations

iS chosen as 16. Thus, bit_numberv_ofof_ietothej-th

ichea1Microelectronics

tehogis(QN)CorporationTesuprofheCnda

(CMC),

under its System-On- processor at the i-thiteration is written intothe i-th address . ^R

of the

j-th

RAM module. Since each submatrix withinH is a Chip Research Network (SOCRN) program, is also circulant matrix obtained bycyclically shifting the rowsof a

qxq identity matrix I to the right, theaddresses for reading

the systematic bits from the RAM modules were generated REFERENCES

easily by several N-bit accumulators, where ^q^<2N The [1] R. G._MIT._Press,Gallager,_1963.Low-Density Parity-Check Codes. Cambridge, MA:

accumulators were initialized with the shifting coefficients

[2]

^B.^Vasic ^and

I.

^B.Djordjevic, Lowdensity paritycheck codes for

Cj I,

which were stored into the read-only-memory long-haul optical communication systems," IEEE Photonics modules. [3] Technology Lett., vol. 14, pp.A. Dholakia, E. Eleftheriou, and T. Mittelholzer,1208-1210,2002. M. P. C. Fossorier, The XOR utilization statistics and the throughput of a "Capacity-approaching codes: Canthey be appliedtothe

magnetic

series of encoders forLDPC codes are listed in TableI.The recording

channel?" IEEE

Communications

Magazine,

vol. 42, no. 2,

encoding enclodin throughputncyF throughput

^weredetermined^rMined

by nusing using

^an

encoding

^encoding ^[4] ^pp.^IEEE^122-130,^802.3 ^Feb^lOGBase-T^2004. Study Group Meeting, World Wide Web, clock frequency F = 100 MHz and a number of encoding http:

/Hwww.ieee8o2.org/3/IOGBT/public/julO4/rao-1-0704.pdf,

^July

iterations of 16. For the LDPC code with a block length of 2004.

2048 and code rate of 0.5, the encoder attains athroughput [5] EricJacobsen, "Drafttext for LDPC coding scheme for

OFDMA",

of12.8Gbit/susing352 XORgates. IEEE 802.16 Broadband Wireless Access Working Group, IEEE C802.16e-04/96,May 12, 2004.

TABLEI. XOR GATES UTILIZATION AND THROUGHPUTSOF [6] European Telecommunication Standards Institue, World WideWeb, SEVERAL ENCODERS FOR THE PROPOSED LDPC CODES. http: //www.dvb.org/documents/white-papers/ wpO6.DVB-

S2.final.pdf.

q Block Code XOR Throughput [7] A.Darabiha, A.C. Carusoneand F. R. Kschischang,"Multi-Gbit/sec length rate gates (Gbit/s) Low Density Parity Check Decoders with Reduced Interconnect 2048 3/4 432 12.8 Complexity, ^" IEEE

International

Symposium on Circuits and

128 2560 4 /5 560 16.0 [8]

Systems,

T. J.Richardson,^pp.

5194-5197,

R. L. Urbanke,May^2005."Efficientencoding oflow-density

3072 5/6 688 19.2 parity-check codes,"IEEE on Information Theory, vol. 47, no. 2, pp.

2048 1 /2 352 12.8 638-656,Feb2001.

256 2560 3/5 472 16.0 [9] M. P. C. Fossorier, "Quasi-cyclic low-density parity-check codes 3072 2 /3 608 _{_____} 19.2₃₀₇₂ _{2 /3} ₆₀₈ _19.2 from circulantvol.^50,^pp.1788-1793,permutationAug. 2004.matrices,"IEEE onInformation Theory,

[10] R. M.Tanner, D.Sridhara,A.Sridharan, T. E. Fuja, D. J. Jr. Costello, V. HARDWAREIMPLEMENTATION "LDPC block andconvolutional codes basedoncirculantmatrices,"

IEEEonInformation Theory, vol. 50,pp.2966- 2984, Dec. 2004.

[11] D. J.C.

MacKay

^{and M.}

Davey,

"Evaluation of

Gallager

codes for encoder for a class of high-performance QC LDPC codes short block length and high rate applications," in Proc. IMA having a parity check matrix in a triangular plus dual- WorkshopCodes,SystemsandGraphicalModels, 1999.

diagonal form. The proposed LDPC codes are linearly

3272

LDPC Over

Encoder Architecture with Throughput Over 10 Gbit/sec for Quasi-cyclic LDPC Codes

Zhiyong He, Se6bastien Roy, and Paul Fortier

[8],

pipelining technology

implementation

complexity

0(n

n2 /6),

unacceptable.

designing

parity

matrix in

triangular

implementation

decoding schemes, low-density parity- throughputs

by using only

exclusive-OR

column

[5],

Digital

Gbits/s,

Gigabit

-regular structured

throughput

encoding problem

L,

Hi, j,

complexity

encoding

quadratic

length. By randomly

Hij A(Cj1),

(j, 1)

constructing

parity

approximate

triangular form,

encoding

Cj,

complexity

0(n

), where

is the block length and

permutation

I.

Cj1,

triangular

[8].

randomly-constructed

1. Cj,

codes,

significant

I,

©C2006

No

Fig.

proposed by

[5],

regular

weight

H

H

H

dual-diagonal

parity part

weight

clearly

Fig.

H21..

?1L4)

proposed

dramatically outperform

~1,

_3~L~ HfLL-2

code,

having

dual-diagonal

1H41 ..H4Ai H341-3 H4

L-2)j

), ^where

^is ^{the block} length ^and

^proposed

^code,

^L-2)j

^{EL .} ^I ^naturally

^equations

⁽³⁾

HSP[U ⁺ ^PI ⁺ ^P2 ]T, (5) are implemented by adding the intermediate bit v and the

^HP34.