Design of Low-Density Parity-Check Coding Systems

(1)

國立臺灣大學電機資訊學院資訊工程學系博士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Doctoral Dissertation

低密度同位檢查碼系統設計

Design of Low-Density Parity-Check Coding Systems

林家瑜 Chia-Yu Lin

指導教授﹕顧孟愷博士 Advisor: Mong-Kai Ku, Ph.D.

中華民國 100 年 3 月 March, 2011

博士論文低密度同位檢查碼系統設計林家瑜撰

100 3

國立臺灣大學

資訊工程學系

(2)

(3)

(4)

(5)

誌謝

本論文的完成，首先要感謝指導教授顧孟愷教授，這幾年在他的照顧之下我在各方面都成長許多，他的教導與建議總是能令我有所啟發，獲益匪淺。謝謝口試委員賴飛羆教授、吳安宇教授、楊佳玲教授、洪士灝教授、廖俊睿教授針對論文提出了許多改進的方向，使論文得以呈現出更好的面貌。感謝實驗室的諸多學長、同學、學弟，

我們在這裡共度了很多珍貴的時刻。感謝好友宇翔在口試準備期間犧牲休息時間大力幫忙，我會永遠銘記在心。謝謝女友劉暢時常地叮嚀與鞭策，並且提供很多論文撰寫上的建議。感謝所有幫助過我、關心過我、鼓勵過我的朋友同學們。最後，謝謝我的父母與家人們在這段期間無條件的支持，讓我無後顧之憂地完成學業。

(6)

(7)

摘要

低密度同位檢查碼 (low-density parity-check code, LDPC code) 具有接近 Shannon 理論極限的錯誤更正效能。但相較於其他錯誤更正碼，LDPC 碼的編碼與解碼通常需要更多的耗電與處理時間，使得它的實際應用受到限制。在本論文中，我們針對設計有效率的 LDPC 碼系統提出了以下技術：一、低時間延遲的編碼方式；二、數種用於解碼的技巧，如減少節點運算的方法、智慧型的排程、提早停止解碼的條件等；三、

具有實作考量的碼結構。

首先，我們針對許多最新通訊標準採用的雙對角 (dual-diagonal) LDPC 碼，提出有效率的編碼演算法。我們提出的編碼方式採用雙向的同位位元 (parity bit) 更正，可分離編碼過程中的資料相依性，以達到更高的產量、更低的時間延遲與更好的硬體利用率。

其次，為了減少 LDPC 解碼器的耗電量與計算時間延遲，我們提出了減少節點運算的方法與智慧型動態排程策略。我們提出的運算減少方法包含了一自動調整的停止條件與一節點停用機制。此節點停用機制可以更準確地判斷節點的可靠度。在另一方面，我們提出了兩種動態解碼排程策略。相較於傳統排程方法，第一種策略以較不貪婪的演算法來選擇下一個要更新的訊息 (message)。此作法用更少的訊息運算量有效地降低了 error floor。第二種策略在排序與選擇下一待更新的訊息時，使用了不同的衡量標準，兼具了訊息差值 (residual) 以外的考量。此種排程策略在可達到的錯誤率與所需訊息運算量上，都優於傳統排程方法。

接著，我們設計了數種低複雜度的解碼提早停止條件。利用雙對角碼的結構特性，我們提出的提早偵測解碼成功的機制，可以排除非必要的解碼迴圈 (iteration)。

此機制減少了可解碼區塊的平均解碼迴圈數，而不會損失錯更正效能。此外，我們也提出了兩種解碼失敗的提早終止機制。第一種機制利用解碼器中的 syndrome 檢查功能來偵測無法解碼的區塊。第二種機制利用相鄰解碼迴圈產生的 hard decision 來追蹤解碼狀況並偵測解碼失敗的收斂。這些機制均可在損失很少或不損失錯誤更正效能的情況下，有效地節省解碼迴圈數。

(8)

最後，我們針對較長碼長的應用，提出了一種有利於實作的結構化 LDPC 碼。我們修改漸進式邊增長 (progressive edge-growth) 演算法來建構所提出之多層次類循環 (hierarchical quasi-cyclic) 碼。藉由在同位檢查矩陣中加入有利於實作的兩層結構，只需要少量的第二層子矩陣，就可以改善類循環碼的錯誤更正效能。此外，類循環碼的解碼器架構經修改後可用於解碼所提出之結構化碼，以達到更好的錯誤率與更快的解碼速度。

關鍵字：低密度同位檢查碼、編碼、解碼、停止條件、動態排程、碼結構。

(9)

Abstract

For error correction in communication systems, low-density parity-check (LDPC) codes have been shown to have near-Shannon-limit performance. However compared with other error correction codes, encoding and decoding LDPC codes always require considerable power and processing time which would limit their practical use. In this thesis, the following techniques are proposed for efficient LDPC coding systems: 1) low-latency encoding, 2) decoding with node operation reduction, intelligent scheduling, and early stopping criteria, and 3) code structure with implementation benefits.

First, an efficient encoding algorithm is proposed for dual-diagonal LDPC codes, which are adopted by many next generation communication standards. The proposed two-way parity bit correction encoding scheme breaks up the data dependency within the encoding process to achieve higher throughput, lower latency, and better hardware utilization.

Next, to reduce the power consumption and computation latency of LDPC decoders, a node operation reduction scheme and intelligent dynamic scheduling strategies are presented. The proposed operation reduction scheme consists of an adaptive stopping criterion and a node deactivation mechanism. The node deactivation mechanism improves the accuracy of node reliability estimation. On the other hand, two dynamic scheduling strategies for LDPC decoders are proposed. The first strategy improves the conventional scheduling algorithms by selecting the next message to update less greedily. The less-greedy scheduling effectively lowers the error floor with fewer message updates. The second strategy orders and selects the next message to update using a different metric with

(10)

considerations beyond the residuals of messages. This farsighted scheduling strategy outperforms the conventional scheduling algorithms in terms of achievable error rate and required number of message updates.

Furthermore, several low-complexity early stopping criteria for LDPC decoders are presented. An early detection mechanism for successful decoding is proposed to eliminate unnecessary iterations by exploiting the structure of dual-diagonal codes. Average number of decoding iterations for decodable blocks can be reduced without error performance degradation. On the other hand, two types of early termination mechanisms are proposed for unsuccessful decoding. In the first mechanism, the syndrome-check block in the decoder is utilized to detect undecodable blocks. In the second mechanism, the hard decisions made during consecutive iterations are used to monitor the decoding status and detect the convergence of unsuccessful decoding. These mechanisms can achieve significant iteration saving with less or no error performance loss.

Finally, we propose a class of implementation-friendly structured LDPC codes for long code length applications. A modified progressive edge-growth algorithm is used to construct the proposed hierarchical quasi-cyclic (H-QC) codes. By adding implementation-friendly two-level hierarchy with limited types of second-level submatrices in the parity check matrix, error performance is improved substantially over QC codes. We also show that QC-based decoder architecture can be easily applied to H-QC decoders to achieve better coding gain and higher throughput performance.

Keywords: low-density parity-check (LDPC) codes, encoding, decoding, stopping criteria, dynamic scheduling, code structure.

(11)

List of Figures

Fig. 1.1. Basic model of a digital communication system. ... 2

Fig. 1.2. Parity check matrix of an (2, 4)-regular LDPC code. ... 3

Fig. 1.3. Tanner graph corresponding to the parity check matrix in Fig. 1.2. ... 5

Fig. 1.4. Length-4 and length-6 cycles in H and its Tanner graph. ... 8

Fig. 2.1. Base matrix of rate 1/2 and N = 2304 IEEE 802.16e LDPC code. ... 14

Fig. 2.2. An example of the proposed encoding method. ... 18

Fig. 2.3. Proposed parallel encoder architecture. ... 19

Fig. 2.4. Encoding order for the proposed parallel encoder architecture. ... 20

Fig. 2.5. Proposed serial encoder architecture. ... 21

Fig. 2.6. Encoding order for the proposed serial encoder architecture. ... 22

Fig. 2.7. Area of the proposed parallel and serial architectures over different code lengths. ... 25

Fig. 2.8. Throughput of the proposed parallel and serial architectures over different code lengths. ... 26

Fig. 2.9. Throughput/area ratio over different code lengths. ... 30

Fig. 3.1. Performance of decoding the IEEE 802.16e (2304, 1152) LDPC code on the AWGN channel with a maximum of 50 iterations under the stopping criterion based on Nspc with fixed and adaptive thresholds: (a) Bit error rate and (b) Average number of iterations per block. ... 38

Fig. 3.2. Block diagram of the decoder adopting the proposed approach. ... 41 Fig. 3.3. Performance of decoding the IEEE 802.16e (2304, 1152) LDPC code on the

(18)

AWGN channel with a maximum of 50 iterations under different operation-reduced techniques: (a) Bit error rate, (b) Average number of iterations per block, and (c) Average number of total node operations per block. ... 44 Fig. 3.4. BLER performance of message selection strategies S1, S2, S3, and S4. ... 54 Fig. 3.5. A dynamic schedule that breaks the trapping-set errors. ... 56 Fig. 3.6. Performance of the farsighted schedules compared with layered and schedules: (a)

BLER and (B) average number of C2B message updates. ... 60 Fig. 3.7. Performance of proposed schedules compared with layered, RBP, and NWRBP schedules: (a) BLER and (b) average number of C2B message updates. ... 62 Fig. 3.8. Hard decision flipping rate (%) of scheduling strategies. ... 63 Fig. 4.1. Unequal error protection of irregular codes: BER of code bits associated with

high degree and low degree BNs for IEEE 802.16e (2304, 1152) code on AWGN channel. ... 66 Fig. 4.2. H of rate 1/2 IEEE 802.16e code: the left part (systematic portion) is with degree 3 and 6 and the right part (parity portion) is mostly with degree 2. ... 67 Fig. 4.3. Performance of layered decoding algorithm: (a) Bit error rate (b) Average saved iterations (%) by the proposed mechanism. ... 70 Fig. 4.4. Percentage of undecodable blocks with N^z greater than the maximum, 3rd

maximum, and 10th maximum N^z among all decodable blocks. ... 73 Fig. 4.5. Percentage of iterations to decode undecodable blocks such that the hard

decisions keep the same. ... 75 Fig. 4.6. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

proposed early termination mechanism A: (a) Average number of required decoding

(19)

iterations (b) Block error rate. ... 79 Fig. 4.7. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

proposed early termination mechanism B. ... 80 Fig. 4.8. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

proposed early termination mechanism C: (a) Average number of required decoding iterations (b) Block error rate. ... 82 Fig. 4.9. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

proposed early termination mechanism C and D with Imax = 3. ... 83 Fig. 4.10. Performance of BP decoding for the PEG (1024, 506) code with the proposed

early termination mechanism A, mechanism D, and the methods in [27] and [39]: (a) Average number of required decoding iterations (b) Block error rate. ... 85 Fig. 4.11. Performance of BP decoding for the PEG (2048, 1018) code with the proposed early termination mechanism A, mechanism D, and the methods in [27] and [39]: (a) Average number of required decoding iterations (b) Block error rate. ... 87 Fig. 4.12. Block diagram of proposed early termination mechanism A and C. ... 89 Fig. 5.1. (j, k)-regular code matrix structure for (a) QC codes in [44][45][46][47][48] and (b) two-level H-QC codes. ... 95 Fig. 5.2. Submatrix construction of a two-level H-QC code with q₁ = 5 and q₂ = 3. (a)

shows the initial empty submatrix. (b)–(e) shows the construction of the first two inner submatrices. Only the shaded regions are the possible positions for nonzero elements.

(f) shows the completed result. ... 97 Fig. 5.3. Partially-parallel decoder architectures for (j, k)-regular QC codes in Fig. 5.1(a).

... 99

(20)

Fig. 5.4. Partially-parallel decoder architectures for (j, k)-regular H-QC codes in Fig.

5.1(b). ... 100 Fig. 5.5. Performance of LDPC codes on the AWGN channel using sum-product decoding algorithm with a maximum of 100 iterations. Rate=1/2, N=10080, 20160, 40320 H-QC codes are compared with QC and random codes with the same rates and lengths. ... 102 Fig. 5.6. Performance of LDPC codes on the AWGN channel using sum-product decoding algorithm with a maximum of 40 iterations. Rate=1/2, N=12288 H-QC codes with different number of inner submatrices are compared. ... 103

(21)

List of Tables

Table 2.1. Comparison between serial and parallel architectures. ... 23

Table 2.2. CPCs of the proposed two architectures. ... 24

Table 2.3. Area reduction (%) of the proposed serial architecture over parallel architecture. ... 27

Table 2.4. Synthesis results of the proposed multi-rate architecture for IEEE 802.16e standard. ... 28

Table 2.5. Synthesis results on FPGA as compared with the results in [21] and [19]... 31

Table 3.1. Message selection strategies. ... 49

Table 3.2. Complexity comparison of dynamic schedules. ... 64

Table 4.1. Complexity comparison. ... 89

Table 4.2. Implementation comparison. ... 90

Table 5.1. Decoder implementation results and comparison on Altera Stratix EP2S130 [50] FPGA device. ... 101

(22)

(23)

Chapter 1 Introduction

1.1 Overview of LDPC Codes

Error control coding (ECC) is a technique used for reliable data transmission over noisy channels by appending redundancy to the data. This technique has become an essential component in modern digital communication systems to reduce the transmitting power requirements as illustrated in Fig. 1.1. Among the various error correcting codes, low-density parity-check (LDPC) codes firstly proposed in the early 1960s by Gallager [1]

are a class of linear block codes with sparse parity check matrices. This discovery was ignored for almost two decades until Tanner’s revisiting in 1981 by providing a graphical representation of LDPC codes [55]. In the late 1990s, MacKay and Neal showed that LDPC codes can achieve excellent error correcting performance which is very close to the Shannon limit [2]. Since then LDPC codes have attracted tremendous attention and been considered to use in practical communication systems or other digital applications such as recording media. In spite of their good error performance, the significant computational burden and processing latency for encoding and decoding LDPC codes become serious problems for their practical use.

(24)

ECC

Channel Encoder

Channel Decoder

Modulation

Demodulation

Noisy Channel

Noisy Channel Source

Encoder

Source Decoder Data

Source

Data Sink

Fig. 1.1. Basic model of a digital communication system.

Various classes of LDPC codes are currently adopted by latest industrial standards such as wireless LAN (IEEE 802.11n) [3], wireless MAN (IEEE 802.16e, WiMax) [4], 10 Gigabit Ethernet (IEEE 802.3an) [5], wireless PAN (IEEE 802.15.3c, UWB) [6], Mobile Broadband Wireless Access (MBWA) (IEEE 802.20) [7], Satellite TV (DVB-S2) [8], and Digital TV in China (DTTB) [9]. All of these codes have structured parity check matrices consisting of square submatrices which facilitate efficient hardware implementation. Codes defined in IEEE 802.11n, IEEE 802.16e, and IEEE 802.20 have dual-diagonal matrix structure for fast encoding. Codes defined in IEEE 802.3an are constructed based on Reed-Solomon (RS) codes [56]. RS-based LDPC codes have good minimum distances and error performance.

An (N, K) LDPC code with code length N and message length K is defined by an M×N sparse parity check matrix H such that any valid codeword c must satisfy

cH^T = 0. (1.1)

M, equal to N-K, denotes the number of parity bits. The N columns of H correspond to the N code bits of a codeword and the M rows of H specify the M parity check constraints that

(25)

exactly j nonzero elements and each row contains exactly k nonzero elements, otherwise it’s an irregular code. Fig. 1.2 shows an (2, 4)-regular code with N = 8 and M = 4. In general, irregular codes with carefully chosen row and column weights perform better than regular codes [57]. Note that longer code lengths always lead to better error correcting performance. The code lengths of LDPC codes employed by modern communication systems range from a few hundreds to thousands in general.

1 0 1 0 1 0 1 0

1 0 0 1 0 1 0 1

H 0 1 1 0 0 1 1 0

0 1 0 1 1 0 0 1

 

 

= 

 

 

Fig. 1.2. Parity check matrix of an (2, 4)-regular LDPC code.

1.2 Encoding

Encoding of a linear block code is to uniquely map an K-bit source message s to an N-bit valid codeword c. The general encoding method is to find an K×N generator matrix G such that

GH^T = 0 (1.2)

and then c can be obtained by

C = sG. (1.3)

Systematic encoding that K information bits of s are a part of c is desired because s can be directly extracted after decoding. For a systematic code that c = [s|p] where p is the M-bit parity block, its generator matrix must be

G = [IK|P], (1.4)

where IK is a size-K identity matrix and P is the parity portion of G. While H of an LDPC

(26)

code is very sparse, the corresponding G is always dense. The encoding complexity of this straightforward method is quadratic to the code length N. A huge number of addition and multiplication (XOR and AND operations for binary inputs) are required to complete the encoding when G is large. Practical encoding algorithms should avoid matrix operations on dense matrices and also keep systematic encoding. This could be achieved by preprocessing H to certain special structure before encoding [14] or placing special structural constraints on H when designing the code [58]. The class of block-type codes proposed in [58] and its subclass, the dual-diagonal codes adopted by IEEE standards [3][4][7], can be encoded in linear time by exploiting their matrix structure. Now the research focus is on designing low-latency hardware-efficient encoding algorithms for this kind of codes.

1.3 Decoding

LDPC codes are usually decoded by a soft decision decoding algorithm based on iterative belief propagation (BP). H can be represented by a bipartite graph known as a Tanner graph [55]. As shown in Fig. 1.3, the Tanner graph has two disjointed sets of nodes, N bit nodes (BNs, also known as variable nodes) and M check nodes (CNs), which correspond to the N columns and M rows of H respectively. An BN vi is connected to an CN cj by an edge if and only if there exists an nonzero element in the entry (j, i) of H, i.e., the code bit represented by vi is contained in the parity check equation represented by cj. Because the Tanner graph displays the incidence relationship between the code bits and the parity check equations that check on them, it can be used to study the iterative BP decoding

(27)

c

₀

c

₁

c

₂

c

₃

v

₀

v

₁

v

₂

v

₃

v

₄

v

₅

v

₆

v

₇

Fig. 1.3. Tanner graph corresponding to the parity check matrix in Fig. 1.2.

The decoding algorithm processes the received soft values iteratively to improve the reliability of each decoded bit based on the Tanner graph. The computed reliability measures of decoded bits at the end of each iteration are used as input for the next iteration.

The hard decisions are made based on these computed reliability measures of decoded bits in each iteration. The decoding process continues until a valid codeword is found or other stopping criteria are satisfied.

The standard iterative decoding algorithm is called sum-product algorithm (SPA). We summarize log-domain SPA as follows [10]. L(c_i) and L(Q_i) are the log-likelihood ratios (LLRs) of the i-th bit of the received and corrected codeword respectively. L(q_ij) is the LLR message from the BN v_i to the CN c_j. L(r_ji) is the LLR message from the CN c_j to the BN v_i.

(28)

Log-domain Sum-Product Algorithm Step 1. [Initialization]

2

, 0

( ) ( ) - , 1 (BEC)

0, ( ) ( ) (-1) log 1 - (BSC)

( ) ( ) 2 / (BI-AWGNC)

i

ij i i

i

y

ij i

ij i i

y

L q L c y

y E L q L c

L q L c y

ε ε σ

+∞ =



= = ∞ =

 =



 

= =  

= =

.

(1.5)

for all i, j for which H_ij = 1. y_i is the received signal.

Step 2. [CN Operation]

' '

' \

( ) ( )

j j

ji i j i j

i V i i V i

L r α φ φ β

∈

 

=

∏

⋅ 

∑

^. ^(1.6)

where

( )

¹

( ) - log tanh( / 2) log -1

x x

x x e

φ = =  e + 

 . (1.7)

Vj is the set of the BNs connected to the CN cj. Step 3. [BN Operation]

'

' \

( ) ( ) ( )

i

ij i j i

j C j

L q L c L r

∈

= +

∑

^. _(1.8)

( ) ( ) ( )

i

i i ji

j C

L Q L c L r

∈

= +

∑

^. _(1.9)

Cj is the set of the CNs connected to the BN vj.

Step 4. [Hard Decision Making and Stopping Condition Testing]

1 if ( ) 0

ˆ 0 else

i i

c  L Q <

=  for i = 0, 1, …, N-1.

If ˆcH^T = or the number of iterations equals the maximum limit, stop; else go to 0

(29)

The decoding process requires a number of iterations to converge to a valid codeword or declare a decoding failure. For every iteration, a huge number of messages are passed between BNs and CNs with complex operation performed in each node. Thus the iterative decoder consumes considerable power and needs a long latency to decode one codeword.

Reducing both the iterations required for one codeword and the operations required in one iteration can be considered to minimize the decoder energy consumption and processing time. In addition, the decoding schedule, i.e., the processing order of BNs and CNs, can greatly affects the convergence speed of the decoder. How to alter the standard decoding schedule presented above to achieve better convergence speed with no error performance degradation is another challenge.

An important issue about the performance of LDPC codes is the error floor phenomenon. For conventional error correcting codes such as Reed-Solomon and convolutional codes, the error performance curve continuously decreases as the signal-to-noise ratio (SNR) becomes higher. However for LDPC codes with iterative decoding, when SNR is higher than a certain value, the performance curve does not decrease as quickly as lower SNRs. This segment of curve is the error floor and the corresponding SNR region is referred to as the error floor region. The error floor is undesired for practical communication systems and should be removed or lowered. The error floor performance is found to be dominated by certain graph structures such as trapping sets [13]. With this knowledge, enhanced decoding approaches such as better decoding scheduling strategies can be designed to break these bad structures and thus lower the error floor.

(30)

1.4 Code Structure

The error correcting performance of the code is directly related to the structure of H (or its Tanner graph). The structure of Tanner graphs has been extensively analyzed in the past to find codes with good performance. Graph conditioning techniques are developed to prevent the structures which limit decoding performance. Among those graph structures, cycles are relative easy to analyze and control. A length-l cycle in a Tanner graph is a closed path composed of l edges as shown in Fig. 1.4. Short cycles prevent the decoding to converge to optimum decoding (maximum likelihood decoding). The girth of a code is the length of the shortest cycle in its Tanner graph. Some code construction algorithms are designed to construct H with a larger girth and better cycle structure [11][12]. In order to lower the error floor, one can also try to construct H with better trapping-set property.

However, these structures are more complex and make code construction more difficult.

1 0 1 0 1 0 1 0

1 0 0 1 0 1 0 1

H 0 1 1 0 0 1 1 0

0 1 0 1 1 0 0 1

 

 

= 

 

 

Length 4 Length 6

c₀ c₁ c₂ c₃ v₀v₁v₂v₃v₄v₅v₆v₇

Fig. 1.4. Length-4 and length-6 cycles in H and its Tanner graph.

In addition, the structure of H may affect the implementation complexity and throughput of the corresponding encoder and decoder. In particular, for H without any structural regularity, the decoder always requires high hardware cost for massage routing and memory access. Some encoder/decoder-oriented codes have been proposed in the past,

(31)

matrix structures, low-cost partially-parallel decoding and linear-time encoding can be realized respectively. However, how to jointly optimize both error correcting performance and encoder/decoder efficiency is still a challenging problem when designing the code structure.

1.5 Contributions and Organizations of This Thesis

From the above discussions, we know that when designing an LDPC coding system, issues about three different parts are needed to address: encoding, decoding, and code structure. In this thesis, we propose several approaches for these three parts to reduce the power consumption or processing time of LDPC coding systems while maintaining good error correcting performance. In Chapter 2, we propose a low-latency encoding algorithm for dual-diagonal LDPC codes which are widely adopted by many next generation communication standards. For LDPC decoders which are usually more complex and consume more power than encoders, we first propose techniques to reduce the overall decoding operations in Chapter 3. Then intelligent scheduling strategies are presented to improve the convergence speed and error performance. In Chapter 4, low-complexity early detection of successful decoding (for dual-diagonal codes) and early termination of unsuccessful decoding are proposed to save the required decoding iterations. Finally in Chapter 5, we design a new class of long-length structured LDPC codes with considerations of both error performance and decoder implementation complexity. Chapter 6 concludes this thesis and presents some possible future works.

(32)

(33)

Chapter 2 Low-Latency Encoding Algorithm for

Dual-Diagonal Codes Based on Two-Way Parity Bit Correction

2.1 Motivation

One drawback of LDPC codes is their high encoding complexity resulted from dense generator matrices. The use of generator matrices in encoding process can be avoided by employing dual-diagonal matrix structure. LDPC codes with dual-diagonal structure is adopted by IEEE 802.11n [3], IEEE 802.16e [4], and IEEE 802.20 [7] standards. This class of codes can be encoded in near-linear time by Richardson and Urbanke’s (RU) method [14]

and in linear time by the sequential method in [15]. RU-based encoder designs [16][17][18][19] reduce the encoding complexity by multiplying with mostly sparse matrices and relatively small dense matrices. On the other hand, sequential algorithm based designs [20][21] involve only operations with sparse matrices. The encoders in [18][19][20][21] are customized for IEEE 802.11n or IEEE 802.16e LDPC codes. The works in [21] and [19] achieve the highest encoding throughput but also require a long encoding latency. Moreover, due to the data dependency of the sequential encoding

(34)

algorithm, the hardware resource cannot be shared efficiently for codes with different code rates and code lengths. Note that the encoding throughput can be increased simply by interleaving multiple encoder instances [16][19]. Thus the most important metric to evaluate the encoder efficiency is the throughput/area ratio. The dual-diagonal matrix structure is exploited in the arbitrary bit generation and correction encoding algorithm [22][23]. This approach achieves low encoding complexity and reduces the encoding latency. However, their algorithm places a special limit on matrix structure that is incompatible with IEEE 802.11n and IEEE 802.16e standards. From their results in [22], the matrix modification causes error correcting performance degradation with higher error floors compared to the original IEEE 802.11n codes.

In this section, we propose a generalized two-way prediction and correction based encoding scheme. The proposed scheme places no limitation on the dual-diagonal matrix structure. Our algorithm can be directly applied to encode IEEE 802.11n and IEEE 802.16e LDPC codes. The encoding latency is lowered thanks to less data dependency in our algorithm. Both serial and parallel architectures are implemented on FPGA to demonstrate the improvement on throughput and throughput/area ratio. A multi-rate IEEE 802.16e encoder is also implemented to show efficiency in hardware sharing. The remainder of this chapter is organized as follows. Section 2.2 introduces dual-diagonal LDPC codes. Section 2.3 presents the proposed encoding algorithm and compares it with other methods. The encoder architecture for the proposed algorithm is described and analyzed in section 2.4.

Section 2.5 shows the hardware implementation results and comparisons with related works.

Finally section 2.6 summarizes this chapter.

(35)

2.2 Dual-Diagonal LDPC Codes

The dual-diagonal parity check matrix H of size M^×N in IEEE 802.11n and 802.16e is defined as

0,0 0,1 0,2 0, 1

1,0 1,1 1,2 1, 1

s p 2,0 2,1 2,2 2, 1

1,0 1,1 1,2 1, 1

P P P P

H= (H ) | (H ) = P P P P

P P P P

b b b

b b b b b

n n

M K M M n

m m m m n

−

× × −

− − − − −

 

 

 

   

 

 



    



(2.1)

where H_s corresponds to the information bits and H_p corresponds to the parity bits. P_i,j is either a circulant permutation matrix or a zero matrix of size z. A circulant permutation matrix is formed by circularly shifting the rows of an identity matrix of size z to the right by certain locations. H is expanded from a base matrix Hb

s p

b b b

H = (H ) | (H )

b b b b

m×k m×m

 

  (2.2)

where mb = M/z and kb = K/z. Each element in Hb is a nonnegative number to represent the shift quantity of the corresponding permutation matrix or -1 to represent a zero matrix respectively. The structure of H_bp is further defined as

bp 1 ( 1)

0 0 0

0

H = (t) | (h) 0

0 0 0

0

b b b

m m m

d

× × −

 

 

  =

   

 

 



 

 (2.3)

where h represents the dual-diagonal portion. Note that d is a positive number and all blank entries are zero elements. By exploiting this structure, the codeword can be encoded recursively in linear time and the encoder complexity can be reduced significantly.

However, the data dependency in the process increases the number of clock cycles needed to encode a codeword.

(36)

The LDPC code lengths (N) in IEEE 802.11n are 648, 1296, and 1944 with sub-block size z = 27, 54, 81 respectively. The code lengths in IEEE 802.16e range from 576 to 2304 with z = 24 to 96. Both standards support code rates of 1/2, 2/3, 3/4, and 5/6. Fig. 2.1 shows a sample code matrix in IEEE 802.16e with rate 1/2.

- 94 73 - - - - - 55 83 - - 7 0 - - - - - - - - - -

- 27 - - - 22 79 9 - - - 12 - 0 0 - - - - - - - - -

- - - 24 22 81 - 33 - - - 0 - - 0 0 - - - - - - - -

61 - 47 - - - - - 65 25 - - - - - 0 0 - - - - - - -

- - 39 - - - 84 - - 41 72 - - - - - 0 0 - - - - - -

- - - - 46 40 - 82 - - - 79 0 - - - - 0 0 - - - - -

- - 95 53 - - - - - 14 18 - - - - - - - 0 0 - - - -

- 11 73 - - - 2 - - 47 - - - - - - - - - 0 0 - - -

12 - - - 83 24 - 43 - - - 51 - - - - - - - - 0 0 - -

- - - - - 94 - 59 - - 70 72 - - - - - - - - - 0 0 -

- - 7 65 - - - - 39 49 - - - - - - - - - - - - 0 0

43 - - - - 66 - 41 - - - 26 7 - - - - - - - - - - 0

Fig. 2.1. Base matrix of rate 1/2 and N = 2304 IEEE 802.16e LDPC code.

2.3 Proposed Encoding Procedure

We denote the information block s = [a0 a1 … ak-1] and the information sub-block si = [aiz aiz+1 … a(i+1)z-1] for i = 0, 1, …, kb-1. Also, let p = [b0 b1 … bm-1] denote the parity block and p_i = [b_iz b_iz+1 … b_(i+1)z-1] for i = 0, 1, …, m_b-1 denote the parity sub-block. The prediction vector pi’ = [b’iz b’iz+1 … b’(i+1)z-1] is defined as the predicted solution of pi. The row index of the nonnegative entry in the middle of the weight-3 column in Hbp is denoted by x. Note that all operations discussed in the following are modulo-2 operations.

2.3.1 Encoding Concept

By definition, a valid codeword c = [s|p] must satisfy the following equation

(37)

[ ]

^t

s p

Hc^t =(H )_{M K}_× | (H )_{M M}_×  s | p =0. (2.4) Replacing Hs and Hp by the dual-diagonal matrix definition in section 2, we get

t t

0,0 0,1 0,2 0, 1 0 0

t t

1,0 1,1 1,2 1, 1 1 1

t t

2,0 2,1 2,2 2, 1 2 2

t

1,0 1,1 1,2 1, 1 1 1

P P P P s I I p

P P P P s I p

I

I I I

I I

P P P P s p

b b b

b b b b b b b

k d

k k

m m m m k k d m

−

− − − − − − −

 

   

 

   

 

   

 

  + 

   

 

   

 

   

 

   

   



 



  

    

 ^t

0

 

 

 =

 

 

(2.5)

where I is the identity matrix and I_d is the circulant permutation matrix with d-position row shifting to the right. After matrix multiplication, we obtain

t t

0 1

t t

t 1

1 2

0, 0

t 1

1,

0 t t

-1

t t t

0 +1

t t

+1 +2

t t

t 1

2 1

1, 0

t t

0 1

(p ) p

p p

P s P s

p p

p p p 0

p p

P s

(p ) p

b b

b

b b

b

d k

j j j

k

j j j

x x

k

k k

m j j

j

d k

=−

− − − −

=

−

 + 

 

 ∑   + 

 ∑   

   + 

 

+ +

 

+ =

 

 + 

 

∑   + 

 

 + 

 



(2.6)

where (p₀)_d is p₀ with d-position circularly shifting to the left. Sum all rows in (2.6), we get

1

0 0

p =

∑

^m_i₌^b⁻ λ_i ^(2.7)

where ^t ¹ _, ^t

0 P s

kb

i j i j j

λ =

∑

₌⁻ . For sequential encoding, p₀ must be calculated first by a series of

shifting and accumulation in (2.7). Then p1 to pmb-1 can be obtained by forward or backward substitution through the equations in (2.6). Due to the data dependency among the parity sub-blocks, the calculation of p₀and the derivation of p₁ to p_mb-1 can not be parallelized to save time.

In our proposed approach, instead of calculating p₀ first, we jump start the calculation by setting p0 as an arbitrary vector p0’ and immediately calculate the prediction vectors pi’s.

(38)

p₁’ and p_mb-1’ can be obtained by p '₁ =λ₀+(p ' )₀ _d and pm_b_-1'=λm_b_-1+(p ')₀ d

respectively. Then the other prediction vectors pi’s are obtained by forward substitution

-1 -1

p '_i =λ_i +p '_i (2.8)

for i = 2, 3, …, x and backward substitution

p '_i = +λ_i p_i₊1'. (2.9)

for i = mb-2, mb-3, …, x+1. After backward substitution, an additional operation ptemp =λx+px₊1' is needed and p_temp will be used in the following computation. It is known there exists a relationship (p )0 d +(p ')0 d = +pi p 'i for i = 1, 2, …, mb-1. We define the correction vector f as

0 0 0 0

f =(p )_d +(p ')_d =(p +p ')_d. (2.10) f is unknown right now because p₀ is not available. However, it is known that

1 1 1

p_x+p_x₊ =(p_x+ +f) (p_x₊ + =f) p ' p_x + _x₊ '. (2.11) Hence from the x-th equation in (2.6), we can calculate p0 by

0 1 1

p =λ_x+p_x+p_x₊ =λ_x+p ' p_x + _x₊ '=p ' p_x + _temp. (2.12) Then f can be obtained by (2.10). If we set p0’ as a zero vector at first, then f is just (p0)d, the shifted version of p₀. Finally, the other parity sub-blocks can be obtained by

p_i =p ' f_i + . (2.13)

2.3.2 Proposed Encoding Scheme

Based on the previous discussion, our proposed encoding algorithm is summarized as follows.

(39)

Proposed Scheme

Step_1._Set p0’ (i.e., b’0, b’1, …, b’z-1) as any binary vector.

Step_2._Compute the vector λ = Hss by circularly shifting and accumulating the sub-blocks of s. (We denote λ = [c0 c_{1 …}c_M-1] and λ = [c_i iz c_iz+1 … c_(i+1)z-1] for i

= 0, 1, …, mb-1).

Step_3._[Forward Derivation] Compute p₁’, p₂’, … , p_x’.

Step_4._[Backward Derivation] Compute p_mb-1’, p_mb-2’ … , p_x+1’and p_temp. Step_5._Compute p0 by adding px’ and ptemp.

Step_6._Compute the correction vector f by circularly shifting the sum of p0 and p0’ to the left by d positions.

Step_7._[Correction] If f is a nonzero vector, then compute p_i by adding p_i’ and f for i = 1 to mb-1. Otherwise, pi is simply pi’.

Comparing with the sequential algorithm, the proposed algorithm reduces encoding latency in the following places. In Step 1, p0’ is set arbitrarily instead of being computed by the matrix operations in (2.7). Then in Step 3 and Step 4, parity sub-blocks p1’, p2’, … , p_mb-1’ can be obtained without knowing p₀. Step 5 and Step 6 compute p₀ and f from (2.12) and (2.10) respectively. In addition, there is no dependency between Step 3 and Step 4, so forward and backward derivation can be executed simultaneously. Since the algorithm proposed in [22] can only generate these bits by forward substitution, our approach reduces the encoding delay further.

(40)

2.3.3 Encoding Example

An encoding example is illustrated in Fig. 2.2. The submatrix size z in this example is 4. At first, b’0, b’1, b’2, b’3 are set to zeros (Step 1). The valueλ = Hss is calculated in Step 2. With these four bits, we can obtain b’4, b’5, …, b’15 (Step 3) and b’23, b’22, …, b’16

(Step 4) by forward and backward derivation, respectively. p_temp = [d₀ d₁d₂d₃] can be calculated by using b’16, b’17, b’18, b’19 and c12, c13, c14, c15. After that, we can easily find p0

by adding p3’ and ptemp (Step 5). Then the correction vector f is the summation of p0 and p₀’ with one-position left circular shifting (Step 6). At last, f is added to p₁’, p₂’, p₃’, p₄’, p₅’ to generate the other parity bits b₄, b₅, …, b₂₃ (Step 7). The final solution we obtained is p = [1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1].

(41)

2.4 Proposed Encoder Architecture

In this section, we present a parallel and a serial architecture based on the proposed encoding algorithm. The hardware complexity and throughput of both architectures are also compared.

2.4.1 Parallel Architecture

Fig. 2.3 shows the block diagram of the proposed parallel architecture. Our architecture takes advantage of the two-way prediction to achieve higher level parallelism.

The encoder architecture is composed of the following four stages. Note that the initial prediction vector p₀’ is set to zero to simplify the hardware.

Fig. 2.3. Proposed parallel encoder architecture.

This first stage carries out the multiplication of submatrices in Hs and information sub-blocks (i.e., s_i) by multiplexers and barrel shifters (part of Step 2 in the proposed scheme). As shown in Fig. 2.4, the multiplication proceeds in a two-way fashion from

(42)

topmost and bottommost block rows of Hs simultaneously to reduce the encoding latency.

Conventional sequential encoder architecture such as [15] uses kb barrel shifters for parallel computation of Hss. kb is 12 for the IEEE 802.16e (2304, 1152) code. Nevertheless, in every computation cycle, certain barrel shifters will be idle due to the existence of zero submatrices in Hs. To minimize the number of idle barrel shifters, our proposed architecture uses 2×α barrel shifters to process two block rows simultaneously, where α is the maximum number of nonzero submatrices in one block row in H_s. Multiplexers are used to select information sub-blocks corresponding to nonzero submatrices. In latter stages, forward derivation corresponding to the upper part of H_s and backward derivation corresponding to the lower part of Hs can also be done in parallel. For the 802.16e (2304, 1152) code, only 10 barrel shifters are needed to achieve two-way computation.

- 94 73 - - - - - 55 83 - - 7 0 - - - - - - - - - -

- 27 - - - 22 79 9 - - - 12 - 0 0 - - - - - - - - -

- - - 24 22 81 - 33 - - - 0 - - 0 0 - - - - - - - -

61 - 47 - - - - - 65 25 - - - - - 0 0 - - - - - - -

- - 39 - - - 84 - - 41 72 - - - - - 0 0 - - - - - -

- - - - 46 40 - 82 - - - 79 0 - - - - 0 0 - - - - -

- - 95 53 - - - - - 14 18 - - - - - - - 0 0 - - - -

- 11 73 - - - 2 - - 47 - - - - - - - - - 0 0 - - -

12 - - - 83 24 - 43 - - - 51 - - - - - - - - 0 0 - -

- - - - - 94 - 59 - - 70 72 - - - - - - - - - 0 0 -

- - 7 65 - - - - 39 49 - - - - - - - - - - - - 0 0

43 - - - - 66 - 41 - - - 26 7 - - - - - - - - - - 0

Fig. 2.4. Encoding order for the proposed parallel encoder architecture.

In the second stage, 2×α z-bit XOR gates are used for two-way computation of λis (part of Step 2) and prediction parity vectors p_i’s (Step 3 and Step 4). p_i’s are computed as soon as any _λ_i is available. The computation of _λ_is and p_i’s are pipelined to reduce the

(43)

meantime p0 and the correction vector f = (p0)d are obtained by one XOR operation (Step 5 and Step 6). In the last stage, pi’s are corrected by f via XOR gates (Step 7). There is no data dependency among these correction operations so pi’s can be computed in parallel.

2.4.2 Serial Architecture

We also propose a serial encoder architecture for low throughput applications. Fig. 2.5 shows the block diagram of the proposed architecture. Encoding operations are scheduled to reduce idling hardware. The architecture is composed of the following three stages.

Fig. 2.5. Proposed serial encoder architecture.

The first stage employs two barrel shifters for matrix multiplication, a reduction by α times of the parallel architecture. The information sub-blocks corresponding to the upper and lower part of Hs are serially selected and processed by the two barrel shifters respectively. Fig. 2.6 shows the processing order for the multiplication operations. All nonzero submatrices are skipped, so at most α cycles are required for each block row instead of kb cycles.

Design of Low-Density Parity-Check Coding Systems

國立臺灣大學電機資訊學院資訊工程學系 博士論文

National Taiwan University Doctoral Dissertation

低密度同位檢查碼系統設計

Design of Low-Density Parity-Check Coding Systems

林家瑜 Chia-Yu Lin

指導教授﹕顧孟愷 博士 Advisor: Mong-Kai Ku, Ph.D.

中華民國 100 年 3 月 March, 2011

誌謝

摘要

Abstract

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 Overview of LDPC Codes

ECC

1.2 Encoding

1.3 Decoding

c

c

c

c

v

v

v

v

v

v

v

v

∏

∑

( )

∑

∑

1.4 Code Structure

1.5 Contributions and Organizations of This Thesis

Chapter 2

Low-Latency Encoding Algorithm for

Dual-Diagonal Codes Based on Two-Way Parity Bit Correction

2.1 Motivation

2.2 Dual-Diagonal LDPC Codes

2.3 Proposed Encoding Procedure

[ ]

∑

∑

2.4 Proposed Encoder Architecture

國立臺灣大學電機資訊學院資訊工程學系博士論文

指導教授﹕顧孟愷博士 Advisor: Mong-Kai Ku, Ph.D.