The 2006 IEEEInternationalSymposium onCircuitsand Systems (ISCAS 2006), May 21-24, 2006
Encoder Architecture with Throughput Over 10 Gbit/sec for Quasi-cyclic LDPC Codes
Zhiyong He, Se6bastien Roy, and Paul Fortier
Departmentof Electrical and ComputerEngineering,LavalUniversity, Quebec City, Quebec,Canada, GIK 7P4
Abstract-This paper discusses the design of a high-speed parity-check matrix. To reduce memory requirements, encoder for low density parity check (LDPC) codes. To several classes of structured LDPC codes have been minimize hardware costs and memory requirements of such proposed, such as LDPC codes based on quasi-cyclic (QC) encoders, a class of high-performance quasi-cyclic LDPC codes codes [9, 10]. By employing the efficient encoding which can be encoded in linear time has been proposed by algorithms discussed in
[8],
the QC LDPC codes exhibit an designing the parity check matrix in a triangular plus dual- R 2diagonal form. Based on the proposed codes, parallel encoding complexity of + ) n2|,where R, n and architectures and
pipelining technology
have been used toincrease the throughput of encoders. Moreover, collisions
which occur whenparallel processors contend for write access J denote code rate, block length and column weight, to the same memory module are avoided by exploiting an respectively. For example, for a QC LDPC code with iterative encoding approach which involves repeated usage of columnweight 3 and code rate 0.5, the encoding scheme has the processors. The
implementation
results into field acomplexity
of0(n
+n2 /6),
which is stillunacceptable.
programmable gate array (FPGA) devices indicate that the
encoder for the LDPC code with a blocklength of 2048 and a In this paper,wepresentaspecialclass ofQCLDPC code rate of 0.5 attains a throughput of 12.8 Gbit/s using 352 codes which have an encoding complexity of 0(n). By exclusive-ORgates.
designing
theparity
checkmatrix in
atriangular
plus dual- diagonal form, the proposed LDPC codescanbe encodedin I. INTRODUCTION linear time. The hardwareimplementation
results indicates Becauseof theircapacity-achieving performance and the that the encoders for the proposed LDPC codes attain existence of effectivedecoding schemes, low-density parity- throughputs
of more than 10 GBit/sby using only
severalcheck (LDPC) codes introduced by Gallager [1] have hundreds
exclusive-OR
(XOR) gates.recently received a lot of interest for reliable high speed
communication applications such as long-haul optical II. CODE CONSTRUCTION
channels [2] and magnetic storage [3]. Several standards This section discusses the code construction of LDPC have been recently proposed which suggested the codeswith a column weight of 4. Extension of the proposed incorporation of LDPC codes: Gigabit Ethernet [4], concept to LDPC codes with arbitrary
column
weight is broadband wireless access networks[5],
andDigital
Video straightforward.Broadcast (DVB) satellite communications [6]. For
application in long-haul optical channels, which support A. Proposed LDPC codes data transmitting rates of 20
Gbits/s,
and inGigabit
Ethernet, whichare expectedtoprovide adatarateup to 10 A (4, L)
-regular structured
LDPC code is defined as a Gbits/s, high throughput of LDPC codes is a critical issue. codeFig. 1,representedinwhich each column hasby the parity-checkweightmatrix4and eachH shown inrowhas Based on parallelizable decoding algorithms, weightL. In Fig. 1, the matrices I and 0 are the qxq identity decoders for LDPC codescan attainathroughput
of several matrix and null matrix, respectively, where q is a positive Gbits/s [7]. However, theencoding problem
becomes an integer. For 1<j<4 and 1 < 1 <L,
the submatrixHi, j,
obstacle forhigh-speed applications because the
complexity
of
encoding
isquadratic
in the blocklength. By randomly
whereHij A(Cj1),
in position(j, 1)
within H is aconstructing
theparity
check matrix in anapproximate
circulant matrix obtained by cyclically shifting the rows oftriangular form,
Richardson et al. reduced theencoding
matrixA to the right byCj,
places. Matrix A is either I orcomplexity
to0(n
+g), where
nis the block length and
g apermutation
matrix ofI.
The coefficientCj1,
where is the gap between the given parity check matrix to atriangular
matrix[8].
Forrandomly-constructed
LDPC1. Cj,
.q, is chosen randomly with the 4-cycle-freecodes,
asignificant
amountofmemoryis neededto storethe constraints. When matrix A is the identity matrixI,
the0-7803-9390-2/06/$20.00
©C2006
IEEE 3269 ISCAS 2006The 2006 IEEEInternationalSymposium on Circuitsand Systems (ISCAS 2006), May 21-24, 2006
proposed codes are a class of QC LDPC codes. The The biterror rate (BER) and the frame error rate (FER) minimum Hamming distances dH of QC LDPC codes with oftheproposed code with acolumn weight of4 versus the column weightJhavean upperbound of (J+1)! [11], e.g., signal-to-noise ratio (SNR)perbit EbI
No
are compared in dH =24when J=3. To increase dH, A is replaced by aFig.
2 with the LDPC codeproposed by
Intel Inc.[5],
the randompermutation matrix ofIfor J=3. so-called Intel code. The systematic parts of the parity check matrices for thetwocodes have aregular
columnweight
ofH
=H
H H HH
4. Since the parity check matrix for the Intel code has aL Hs Pi P2 P3 P4
dual-diagonal
form, all columns in theparity part
have aH1,1 ...Hl,(L-4)I 0 0 0 uniform column
weight
of 2. It is shownclearly
inFig.
2H21..
Hi?1L4)
H I O 0 that theproposed
codesdramatically outperform
the Intel~1,
HS_3~L~ HfLL-2
H2~L-3~2(4,II 0 0code,
the latterhaving
adual-diagonal
form. For the1H41 ..H4Ai H341-3 H4
IL-2)j
proposed code shown in Fig. 2, q=100 was chosen for each submatrix. The size q can be chosen as a power of 2 Fig. 1.Proposed parity-check matrix. Hs is comprised for the benefit of simple hardware implementation, e.g., ofa 4x(L -4)array of circulant submatrices. q =128 or 256. Shown inFig.
3 are theproposed
codes withacolumn weight of4and different coderates. Since a Having the parity check matrixin a triangular form, the uniform size q = 256 is chosen, the three codes with code proposed LDPC codesarelinearly encodable. To removethe rates 0.5, 0.6 and 2/3 have block lengths 2048, 2560 and columns of weight 1 which may cause an error floor, the 3072respectively.
submatrixatthelower-rightcorner in Hhas adual-diagonal form. On the other hand, matrices
Hp,
andHP2
within Hhave a column weight of4 and 3, respectively. Thus, in H
10|o2
for the proposed codes, 2q, q and q(L -3) columns have
weights 2,
3 and4, respectively.
Inordertosupport
alayered
decoding algorithm, the dual-diagonal matrix at the lower- rightcorner ofH canbedecomposed into four sub-matricesI (q l) ,61 X
by
row-columnpermutation.
The n=2048,R=0.5n=2560,R=0.6
proposed code hasacoderateofR=
(1
-JIL). 10-8
-e- n=3072, R=2/31 1.5 2 Eb/NO 2.5 3
B. Performanceanalysis
To evaluate the proposed LDPC codes, we performed
simulations assuming binary phase-shift keying
(BPSK)
Fig. 3. BERperformances of LDPC codes modulation and an AWGN channel. The iterative belief- with a block length n and code rate R.propagation (BP) algorithm was used fordecoding. TheBP decoder stops when either a valid codeword is foundor 50
decoding iterations arereached. A.
Encoding algorithm
Since LDPC codes are linear codes, x is acodeword if
100
and only ifS u Blocklength: 800 H XT=
Coderate :0.5 (1)
In
Fig. 4(a),
xissplit
into 5parts,
i.e. x =(u, PI, P2,
P3,lY >*sss
p4), where u denotes the systematic part, andPl, P2, p3, andIL- S'SEss,
p4
denote theparity parts.
The aboveequation
issplit
rt
,0-4L,EL . I naturally
into 3equations
asfollows:0~ io0'
w
mP
=H
UT (2)-- FER Intel T TT
10-- FERProposed P
HS2U
+H2,L-3
I(3)
-$-- BERIntel22L3P
-E- BERProposed]
Fig. 2. Bit-erro -rate and frame-error-rate compari5sons
TH 3 ] LH_ [1
.5 2Eb/N+
253T35+- -H1
I TH3,L2]P
Tbetween the Intel code and the proposed
code.(4
3270
The 2006 IEEEInternational Symposium on Circuits and Systems(ISCAS2006),May21-24, 2006
Let usdefine anintermediatecolumn vector propagation time from input signal to output signal. Since
HP34
is adual-diagonalmatrix, the XOR operations forHP34 VT [v(1), v(2),..., v(4q)]T
=HSP[U + PI + P2 ]T, (5) are implemented by adding the intermediate bit v and the
previous
parity
bitp(i -1). Fig. 5(b)
shows the processor with5 inputsand 4 outputsused for the XOR processing ofH
Si O OHP34.
efficiently with several processors.Pipelining is exploited to perform this XOR stepHS2 H2,L-3
HSP
= H H H (6) B.Parallel
architectures and iterativetechnology
S3 3,L-3 3,L-2 To increase the
encoding speed,
XORprocessing
forHS4 H4,L-3 H4,L-2_ Hsp
can beperformed
inparallel by using repeatedly
Mprocessors fort
times,
where t=mI M andmis thenumber The resulting structure is shown in Fig. 4(a). Combining2p
(2)-(4) with (5), the encoding equations are defined as of rowsinHsp. Then, q processors with k+ 1 inputs and
follows: k
outputs
areusedforthe XORkxtprocessing
inHP34,
where2q
P1 (i)
=v(i), (7)
is the number ofrowsinHP34.
P2(i)=v(i+(q), (8)
1p3(i)=v(i+2q)+p3(i-1), (9)
p4(i)
V(i
+3q)
+p4(i-1),(10)
where 1< i<q.
Ui
(a)
HS2
HZfL~3)
I 0 0 X p2T =F a7
HS3 H3,(L-3) H3i(Lr) in P3t -4 - outpu
H58 H4A(L3 H4LJ2 p4T 0 P4
Matres
( P As anintuitive
example, consider a code with a blocklengthof 128 and a coderate of 0.5. Matrices Hsp andHp34
Hsi
Puu
have 64 and 32 rows, respectively. When 16 processors areTtused for
Hsp, t = 4 clock cycles are needed, i.e. the rows1OR O PI ranging from 1 to 16 are processed in parallel at the first
operation XOR |clock cycle, while the rows ranging from 49 to 64 are
u
Matris
for H F H lpoperatlionHp34
P3processed at the 4-th clock cycle. The XOR processing insH34can beperformed
by using 2 processors with5 inputs (b) have 64and4 outputs, i.e. therows
rangingfrom 1 toHp3416
in areprocessed
by the first processor in 4 clockcycles,
and the Fig. 4. (a) Encodingequation H x = 0. Thecodeword x rowsranging
from 17 to 32 are processed by the second is split into 5parts, i.e. x = (u,Pi,
P2, P3^p4).
(b) Block processor.diagramofthe encoder for the proposed LDPC code.
The
proposed parallel
architectures, pipelining Theblock
diagram of the encodingprocess is displayed technology,anditerative encoding approach areapplicableto in Fig.4(b). First,
the systematic bits vector u is multiplied other LDPC codes having an arbitrary parity check matrix.by
Hsp
toobtain
the parity bitsPt
and P2.Then,
the vector For example, consider an mxn matrix H in a dual-diagonal (u,Pdi
P2) is multiplied by Hsp to obtain the intermediate form Lett be thenumberofiterations. The mit processors vector v. Since Hsp is a sparse matrix with row weight L-2, can be used to perform XOR processing in parallel for thethe multiplication
is simply implemented with L-3 XOR systematic part of H, while mlI(kxit)
processor with(k+l) gates. Fig. 5(a)
showsthe processor
with 10-inputs-one- inputs and koutputs are used to perform XORprocessing inoutput
used for the XOR processingof
Hsphaving
a rowthe
parity part.weight of 10. A tree architecture is employed to reduce the
3271
The 2006 IEEEInternationalSymposium onCircuitsandSystems (ISCAS2006),May21-24,2006
IV. HARDWAREIMPLEMENTATION encodable and have better memory efficiency. Parallel Aseries of encoders for LDPC codes with various block encoding and pipelining structures were exploited to lengths and various code rates were implemented into increase encoding throughput. To avoid memory access Xilinx Virtex-II Field Programmable Gate Array (FPGA) collisions when parallel processors contend for access to the devices. The systematic bits u were stored intothe multiple- same module, an
iterative
encoding approach was proposed.port
Random Access Memory (RAM) with one port for The advantages in terms of hardware savings and encoding writing and multiple ports for reading. For FPGA devicesthroughputs
for theproposed
LDPC codes have been which only provide dual-port RAM, the bits u were stored characterized by implementing a series of encoders into a into four sets of RAM. The intermediate bits vwere stored XilinxFPGAdevice. For an LDPC code with ablock length into the dual-port RAM with one port for writing and one ofseveralthousandsbits, theencoder attains a throughput of port for reading. Each dual-port or multiple-port RAM is more than 10 Gbit/s using only several hundreds XOR configured as a 16 x 1-bitmodule with 16addresses so that gates. Thisdemonstrates that theproposedLDPC codes are M modules can store 16M bits. A total of n bits were suitable for high-speed applications in the Gbit/s region, needed, divided up into n/16 modules, for a LDPC code such as long-haul optical channels and Gigabit Ethernet.withablocklength of n. AcKNOWLEDGMENT
With the multi-port
configuration,
several processorscanread
simultaneously
the data from thesameRAMmodulein Engineering Research Council ofCanada (NSERC) and Le each clock cycle, but only one processor canwritedataintoEng
eReeosdea Cucilrof Canada(nSeRC
an lea
given
module. To avoid write accesscollisions,
the Fondsquobocois
de la recherche sur la nature et les numberof iterations
iS chosen as 16. Thus, bitnumbervofofietothej-thichea1Microelectronics
tehogis(QN)CorporationTesuprofheCnda(CMC),
under its System-On- processor at the i-thiteration is written intothe i-th address . Rof the
j-th
RAM module. Since each submatrix withinH is a Chip Research Network (SOCRN) program, is also circulant matrix obtained bycyclically shifting the rowsof aqxq identity matrix I to the right, theaddresses for reading
the systematic bits from the RAM modules were generated REFERENCES
easily by several N-bit accumulators, where q<2N The [1] R. G.MIT.Press,Gallager,1963.Low-Density Parity-Check Codes. Cambridge, MA:
accumulators were initialized with the shifting coefficients
[2]
B.Vasic andI.
B.Djordjevic, Lowdensity paritycheck codes forCj I,
which were stored into the read-only-memory long-haul optical communication systems," IEEE Photonics modules. [3] Technology Lett., vol. 14, pp.A. Dholakia, E. Eleftheriou, and T. Mittelholzer,1208-1210,2002. M. P. C. Fossorier, The XOR utilization statistics and the throughput of a "Capacity-approaching codes: Canthey be appliedtothemagnetic
series of encoders forLDPC codes are listed in TableI.The recordingchannel?" IEEE
CommunicationsMagazine,
vol. 42, no. 2,encoding enclodin throughputncyF throughput
weredeterminedrMinedby nusing using
anencoding
encoding [4] pp.IEEE122-130,802.3 FeblOGBase-T2004. Study Group Meeting, World Wide Web, clock frequency F = 100 MHz and a number of encoding http:/Hwww.ieee8o2.org/3/IOGBT/public/julO4/rao-1-0704.pdf,
Julyiterations of 16. For the LDPC code with a block length of 2004.
2048 and code rate of 0.5, the encoder attains athroughput [5] EricJacobsen, "Drafttext for LDPC coding scheme for
OFDMA",
of12.8Gbit/susing352 XORgates. IEEE 802.16 Broadband Wireless Access Working Group, IEEE C802.16e-04/96,May 12, 2004.
TABLEI. XOR GATES UTILIZATION AND THROUGHPUTSOF [6] European Telecommunication Standards Institue, World WideWeb, SEVERAL ENCODERS FOR THE PROPOSED LDPC CODES. http: //www.dvb.org/documents/white-papers/ wpO6.DVB-
S2.final.pdf.
q Block Code XOR Throughput [7] A.Darabiha, A.C. Carusoneand F. R. Kschischang,"Multi-Gbit/sec length rate gates (Gbit/s) Low Density Parity Check Decoders with Reduced Interconnect 2048 3/4 432 12.8 Complexity, " IEEE
International
Symposium on Circuits and128 2560 4 /5 560 16.0 [8]
Systems,
T. J.Richardson,pp.5194-5197,
R. L. Urbanke,May2005."Efficientencoding oflow-density3072 5/6 688 19.2 parity-check codes,"IEEE on Information Theory, vol. 47, no. 2, pp.
2048 1 /2 352 12.8 638-656,Feb2001.
256 2560 3/5 472 16.0 [9] M. P. C. Fossorier, "Quasi-cyclic low-density parity-check codes 3072 2 /3 608 _____ 19.23072 2 /3 608 19.2 from circulantvol.50,pp.1788-1793,permutationAug. 2004.matrices,"IEEE onInformation Theory,
[10] R. M.Tanner, D.Sridhara,A.Sridharan, T. E. Fuja, D. J. Jr. Costello, V. HARDWAREIMPLEMENTATION "LDPC block andconvolutional codes basedoncirculantmatrices,"
IEEEonInformation Theory, vol. 50,pp.2966- 2984, Dec. 2004.
[11] D. J.C.
MacKay
and M.Davey,
"Evaluation ofGallager
codes for encoder for a class of high-performance QC LDPC codes short block length and high rate applications," in Proc. IMA having a parity check matrix in a triangular plus dual- WorkshopCodes,SystemsandGraphicalModels, 1999.diagonal form. The proposed LDPC codes are linearly
3272