## 國立臺灣大學電機資訊學院資訊工程學系 博士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

### National Taiwan University Doctoral Dissertation

## 低密度同位檢查碼系統設計

## Design of Low-Density Parity-Check Coding Systems

## 林家瑜 Chia-Yu Lin

## 指導教授﹕顧孟愷 博士 Advisor: Mong-Kai Ku, Ph.D.

## 中華民國 100 年 3 月 March, 2011

博士論文 低密度同位檢查碼系統設計 林家瑜 撰

**100**
**3**

國立臺灣大學

資訊工程學系

**誌謝 **

本論文的完成，首先要感謝指導教授顧孟愷教授，這幾年在他的照顧之下我在各 方面都成長許多，他的教導與建議總是能令我有所啟發，獲益匪淺。謝謝口試委員賴 飛羆教授、吳安宇教授、楊佳玲教授、洪士灝教授、廖俊睿教授針對論文提出了許多 改進的方向，使論文得以呈現出更好的面貌。感謝實驗室的諸多學長、同學、學弟，

我們在這裡共度了很多珍貴的時刻。感謝好友宇翔在口試準備期間犧牲休息時間大力 幫忙，我會永遠銘記在心。謝謝女友劉暢時常地叮嚀與鞭策，並且提供很多論文撰寫 上的建議。感謝所有幫助過我、關心過我、鼓勵過我的朋友同學們。最後，謝謝我的 父母與家人們在這段期間無條件的支持，讓我無後顧之憂地完成學業。

**摘要 **

低密度同位檢查碼 (low-density parity-check code, LDPC code) 具有接近 Shannon 理論極限的錯誤更正效能。但相較於其他錯誤更正碼，LDPC 碼的編碼與解碼通常需 要更多的耗電與處理時間，使得它的實際應用受到限制。在本論文中，我們針對設計 有效率的 LDPC 碼系統提出了以下技術：一、低時間延遲的編碼方式；二、數種用於 解碼的技巧，如減少節點運算的方法、智慧型的排程、提早停止解碼的條件等；三、

具有實作考量的碼結構。

首先，我們針對許多最新通訊標準採用的雙對角 (dual-diagonal) LDPC 碼，提出 有效率的編碼演算法。我們提出的編碼方式採用雙向的同位位元 (parity bit) 更正，可 分離編碼過程中的資料相依性，以達到更高的產量、更低的時間延遲與更好的硬體利 用率。

其次，為了減少 LDPC 解碼器的耗電量與計算時間延遲，我們提出了減少節點運 算的方法與智慧型動態排程策略。我們提出的運算減少方法包含了一自動調整的停止 條件與一節點停用機制。此節點停用機制可以更準確地判斷節點的可靠度。在另一方 面，我們提出了兩種動態解碼排程策略。相較於傳統排程方法，第一種策略以較不貪 婪的演算法來選擇下一個要更新的訊息 (message)。此作法用更少的訊息運算量有效 地降低了 error floor。第二種策略在排序與選擇下一待更新的訊息時，使用了不同的 衡量標準，兼具了訊息差值 (residual) 以外的考量。此種排程策略在可達到的錯誤率 與所需訊息運算量上，都優於傳統排程方法。

接著，我們設計了數種低複雜度的解碼提早停止條件。利用雙對角碼的結構特 性，我們提出的提早偵測解碼成功的機制，可以排除非必要的解碼迴圈 (iteration)。

此機制減少了可解碼區塊的平均解碼迴圈數，而不會損失錯更正效能。此外，我們也 提出了兩種解碼失敗的提早終止機制。第一種機制利用解碼器中的 syndrome 檢查功 能來偵測無法解碼的區塊。第二種機制利用相鄰解碼迴圈產生的 hard decision 來追蹤 解碼狀況並偵測解碼失敗的收斂。這些機制均可在損失很少或不損失錯誤更正效能的 情況下，有效地節省解碼迴圈數。

最後，我們針對較長碼長的應用，提出了一種有利於實作的結構化 LDPC 碼。我 們修改漸進式邊增長 (progressive edge-growth) 演算法來建構所提出之多層次類循環 (hierarchical quasi-cyclic) 碼。藉由在同位檢查矩陣中加入有利於實作的兩層結構，只 需要少量的第二層子矩陣，就可以改善類循環碼的錯誤更正效能。此外，類循環碼的 解碼器架構經修改後可用於解碼所提出之結構化碼，以達到更好的錯誤率與更快的解 碼速度。

關鍵字：低密度同位檢查碼、編碼、解碼、停止條件、動態排程、碼結構。

**Abstract **

For error correction in communication systems, low-density parity-check (LDPC) codes have been shown to have near-Shannon-limit performance. However compared with other error correction codes, encoding and decoding LDPC codes always require considerable power and processing time which would limit their practical use. In this thesis, the following techniques are proposed for efficient LDPC coding systems: 1) low-latency encoding, 2) decoding with node operation reduction, intelligent scheduling, and early stopping criteria, and 3) code structure with implementation benefits.

First, an efficient encoding algorithm is proposed for dual-diagonal LDPC codes, which are adopted by many next generation communication standards. The proposed two-way parity bit correction encoding scheme breaks up the data dependency within the encoding process to achieve higher throughput, lower latency, and better hardware utilization.

Next, to reduce the power consumption and computation latency of LDPC decoders, a node operation reduction scheme and intelligent dynamic scheduling strategies are presented. The proposed operation reduction scheme consists of an adaptive stopping criterion and a node deactivation mechanism. The node deactivation mechanism improves the accuracy of node reliability estimation. On the other hand, two dynamic scheduling strategies for LDPC decoders are proposed. The first strategy improves the conventional scheduling algorithms by selecting the next message to update less greedily. The less-greedy scheduling effectively lowers the error floor with fewer message updates. The second strategy orders and selects the next message to update using a different metric with

considerations beyond the residuals of messages. This farsighted scheduling strategy outperforms the conventional scheduling algorithms in terms of achievable error rate and required number of message updates.

Furthermore, several low-complexity early stopping criteria for LDPC decoders are presented. An early detection mechanism for successful decoding is proposed to eliminate unnecessary iterations by exploiting the structure of dual-diagonal codes. Average number of decoding iterations for decodable blocks can be reduced without error performance degradation. On the other hand, two types of early termination mechanisms are proposed for unsuccessful decoding. In the first mechanism, the syndrome-check block in the decoder is utilized to detect undecodable blocks. In the second mechanism, the hard decisions made during consecutive iterations are used to monitor the decoding status and detect the convergence of unsuccessful decoding. These mechanisms can achieve significant iteration saving with less or no error performance loss.

Finally, we propose a class of implementation-friendly structured LDPC codes for long code length applications. A modified progressive edge-growth algorithm is used to construct the proposed hierarchical quasi-cyclic (H-QC) codes. By adding implementation-friendly two-level hierarchy with limited types of second-level submatrices in the parity check matrix, error performance is improved substantially over QC codes. We also show that QC-based decoder architecture can be easily applied to H-QC decoders to achieve better coding gain and higher throughput performance.

**Keywords: low-density parity-check (LDPC) codes, encoding, decoding, stopping criteria, **
dynamic scheduling, code structure.

**Contents **

誌謝 ... i

摘要 ...iii

**Abstract ... v **

**Contents ... vii **

**List of Figures ...xiii **

**List of Tables ... xvii **

**Chapter 1 Introduction ... 1 **

1.1 Overview of LDPC Codes ... 1

1.2 Encoding ... 3

1.3 Decoding ... 4

1.4 Code Structure ... 8

1.5 Contributions and Organizations of This Thesis ... 9

**Chapter 2 Low-Latency Encoding Algorithm for Dual-Diagonal Codes Based on **
**Two-Way Parity Bit Correction ... 11 **

2.1 Motivation ... 11

2.2 Dual-Diagonal LDPC Codes ... 13

2.3 Proposed Encoding Procedure ... 14

2.3.1 Encoding Concept ... 14

2.3.2 Proposed Encoding Scheme ... 16

2.3.3 Encoding Example ... 18

2.4 Proposed Encoder Architecture ... 19

2.4.1 Parallel Architecture ... 19

2.4.2 Serial Architecture ... 21

2.4.3 Analysis of Hardware Complexity and Encoding Latency ... 22

2.5 Implementation Results ... 24

2.5.1 Results of The Proposed Encoder Architecture ... 24

2.5.2 Results of Multi-Rate Encoder ... 27

2.5.3 Encoder Performance Comparison ... 28

2.6 Summary ... 32

**Chapter 3 Decoding with Reduced Node Operations and Intelligent Scheduling ... 33 **

3.1 Node Operation Reduced Decoding ... 33

3.1.1 Motivation ... 33

3.1.2 Stopping Criterion with an Adaptive Threshold ... 35

3.1.3 Proposed Node Deactivation Technique ... 38

3.1.4 Performance ... 41

3.1.5 Summary ... 44

3.2 Less-Greedy and Farsighted Dynamic Scheduled Decoding ... 45

3.2.1 Existing Decoding Schedules ... 45

3.2.2 Proposed Dynamic Scheduling Strategies ... 46

3.2.3 Simulation Results ... 58

3.2.4 Complexity Analysis ... 63

3.2.5 Summary ... 64

**Chapter 4 Stopping Criteria for Successful and Unsuccessful Decoding ... 65 **

4.1 Early Detection of Successful Decoding for Dual-Diagonal Codes ... 65

4.1.1 Motivation ... 65

4.1.2 Proposed Early Detection Mechanism for Successful Decoding ... 67

4.1.3 Simulation Results ... 68

4.1.4 Summary ... 70

4.2 Early Termination of Unsuccessful Decoding ... 71

4.2.1 Motivation ... 71

4.2.2 Proposed Syndrome-Based Early Termination ... 71

4.2.3 Proposed Hard-Decision-Based Early Termination ... 75

4.2.4 Simulation Results and Discussion ... 77

4.2.5 Complexity Analysis ... 87

4.2.6 Summary ... 90

**Chapter 5 Design of Long Length Codes with Performance and Implementation **
**Considerations ... 93 **

5.1 Hierarchical Quasi-Cyclic Codes ... 93

5.2 Code Construction ... 96

5.3 Decoder Implementation Issues ... 97

5.4 Simulation Results ... 101

5.5 Summary ... 103

**Chapter 6 Conclusions and Future Work ... 105 **

6.2 Future Work ... 107

**Bibliography ... 109 **

** List of Figures **

Fig. 1.1. Basic model of a digital communication system. ... 2

Fig. 1.2. Parity check matrix of an (2, 4)-regular LDPC code. ... 3

Fig. 1.3. Tanner graph corresponding to the parity check matrix in Fig. 1.2. ... 5

**Fig. 1.4. Length-4 and length-6 cycles in H and its Tanner graph. ... 8 **

*Fig. 2.1. Base matrix of rate 1/2 and N = 2304 IEEE 802.16e LDPC code. ... 14 *

Fig. 2.2. An example of the proposed encoding method. ... 18

Fig. 2.3. Proposed parallel encoder architecture. ... 19

Fig. 2.4. Encoding order for the proposed parallel encoder architecture. ... 20

Fig. 2.5. Proposed serial encoder architecture. ... 21

Fig. 2.6. Encoding order for the proposed serial encoder architecture. ... 22

Fig. 2.7. Area of the proposed parallel and serial architectures over different code lengths. ... 25

Fig. 2.8. Throughput of the proposed parallel and serial architectures over different code lengths. ... 26

Fig. 2.9. Throughput/area ratio over different code lengths. ... 30

Fig. 3.1. Performance of decoding the IEEE 802.16e (2304, 1152) LDPC code on the AWGN channel with a maximum of 50 iterations under the stopping criterion based on Nspc with fixed and adaptive thresholds: (a) Bit error rate and (b) Average number of iterations per block. ... 38

Fig. 3.2. Block diagram of the decoder adopting the proposed approach. ... 41 Fig. 3.3. Performance of decoding the IEEE 802.16e (2304, 1152) LDPC code on the

AWGN channel with a maximum of 50 iterations under different operation-reduced techniques: (a) Bit error rate, (b) Average number of iterations per block, and (c) Average number of total node operations per block. ... 44 Fig. 3.4. BLER performance of message selection strategies S1, S2, S3, and S4. ... 54 Fig. 3.5. A dynamic schedule that breaks the trapping-set errors. ... 56 Fig. 3.6. Performance of the farsighted schedules compared with layered and schedules: (a)

BLER and (B) average number of C2B message updates. ... 60 Fig. 3.7. Performance of proposed schedules compared with layered, RBP, and NWRBP schedules: (a) BLER and (b) average number of C2B message updates. ... 62 Fig. 3.8. Hard decision flipping rate (%) of scheduling strategies. ... 63 Fig. 4.1. Unequal error protection of irregular codes: BER of code bits associated with

high degree and low degree BNs for IEEE 802.16e (2304, 1152) code on AWGN
channel. ... 66
Fig. 4.2. H of rate 1/2 IEEE 802.16e code: the left part (systematic portion) is with degree
3 and 6 and the right part (parity portion) is mostly with degree 2. ... 67
Fig. 4.3. Performance of layered decoding algorithm: (a) Bit error rate (b) Average saved
iterations (%) by the proposed mechanism. ... 70
*Fig. 4.4. Percentage of undecodable blocks with N** ^{z}* greater than the maximum, 3rd

*maximum, and 10th maximum N** ^{z}* among all decodable blocks. ... 73
Fig. 4.5. Percentage of iterations to decode undecodable blocks such that the hard

decisions keep the same. ... 75 Fig. 4.6. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

proposed early termination mechanism A: (a) Average number of required decoding

iterations (b) Block error rate. ... 79 Fig. 4.7. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

proposed early termination mechanism B. ... 80 Fig. 4.8. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

proposed early termination mechanism C: (a) Average number of required decoding iterations (b) Block error rate. ... 82 Fig. 4.9. Performance of BP decoding for IEEE 802.16e (1152, 576) code with the

*proposed early termination mechanism C and D with I**max* = 3. ... 83
Fig. 4.10. Performance of BP decoding for the PEG (1024, 506) code with the proposed

early termination mechanism A, mechanism D, and the methods in [27] and [39]: (a)
Average number of required decoding iterations (b) Block error rate. ... 85
Fig. 4.11. Performance of BP decoding for the PEG (2048, 1018) code with the proposed
early termination mechanism A, mechanism D, and the methods in [27] and [39]: (a)
Average number of required decoding iterations (b) Block error rate. ... 87
Fig. 4.12. Block diagram of proposed early termination mechanism A and C. ... 89
*Fig. 5.1. (j, k)-regular code matrix structure for (a) QC codes in [44][45][46][47][48] and *
(b) two-level H-QC codes. ... 95
*Fig. 5.2. Submatrix construction of a two-level H-QC code with q*_{1}* = 5 and q** _{2}* = 3. (a)

shows the initial empty submatrix. (b)–(e) shows the construction of the first two inner submatrices. Only the shaded regions are the possible positions for nonzero elements.

(f) shows the completed result. ... 97
*Fig. 5.3. Partially-parallel decoder architectures for (j, k)-regular QC codes in Fig. 5.1(a).*

... 99

*Fig. 5.4. Partially-parallel decoder architectures for (j, k)-regular H-QC codes in Fig. *

5.1(b). ... 100
Fig. 5.5. Performance of LDPC codes on the AWGN channel using sum-product decoding
*algorithm with a maximum of 100 iterations. Rate=1/2, N=10080, 20160, 40320 H-QC *
codes are compared with QC and random codes with the same rates and lengths. ... 102
Fig. 5.6. Performance of LDPC codes on the AWGN channel using sum-product decoding
*algorithm with a maximum of 40 iterations. Rate=1/2, N=12288 H-QC codes with *
different number of inner submatrices are compared. ... 103

**List of Tables **

Table 2.1. Comparison between serial and parallel architectures. ... 23

Table 2.2. CPCs of the proposed two architectures. ... 24

Table 2.3. Area reduction (%) of the proposed serial architecture over parallel architecture. ... 27

Table 2.4. Synthesis results of the proposed multi-rate architecture for IEEE 802.16e standard. ... 28

Table 2.5. Synthesis results on FPGA as compared with the results in [21] and [19]... 31

Table 3.1. Message selection strategies. ... 49

Table 3.2. Complexity comparison of dynamic schedules. ... 64

Table 4.1. Complexity comparison. ... 89

Table 4.2. Implementation comparison. ... 90

Table 5.1. Decoder implementation results and comparison on Altera Stratix EP2S130 [50] FPGA device. ... 101

**Chapter 1 **

**Introduction **

**1.1 Overview of LDPC Codes **

Error control coding (ECC) is a technique used for reliable data transmission over noisy channels by appending redundancy to the data. This technique has become an essential component in modern digital communication systems to reduce the transmitting power requirements as illustrated in Fig. 1.1. Among the various error correcting codes, low-density parity-check (LDPC) codes firstly proposed in the early 1960s by Gallager [1]

are a class of linear block codes with sparse parity check matrices. This discovery was ignored for almost two decades until Tanner’s revisiting in 1981 by providing a graphical representation of LDPC codes [55]. In the late 1990s, MacKay and Neal showed that LDPC codes can achieve excellent error correcting performance which is very close to the Shannon limit [2]. Since then LDPC codes have attracted tremendous attention and been considered to use in practical communication systems or other digital applications such as recording media. In spite of their good error performance, the significant computational burden and processing latency for encoding and decoding LDPC codes become serious problems for their practical use.

**ECC**

Channel
Encoder
Channel Decoder

Modulation

Demodulation

Noisy Channel

Noisy Channel Source

Encoder

Source Decoder Data

Source

Data Sink

Fig. 1.1. Basic model of a digital communication system.

Various classes of LDPC codes are currently adopted by latest industrial standards such as wireless LAN (IEEE 802.11n) [3], wireless MAN (IEEE 802.16e, WiMax) [4], 10 Gigabit Ethernet (IEEE 802.3an) [5], wireless PAN (IEEE 802.15.3c, UWB) [6], Mobile Broadband Wireless Access (MBWA) (IEEE 802.20) [7], Satellite TV (DVB-S2) [8], and Digital TV in China (DTTB) [9]. All of these codes have structured parity check matrices consisting of square submatrices which facilitate efficient hardware implementation. Codes defined in IEEE 802.11n, IEEE 802.16e, and IEEE 802.20 have dual-diagonal matrix structure for fast encoding. Codes defined in IEEE 802.3an are constructed based on Reed-Solomon (RS) codes [56]. RS-based LDPC codes have good minimum distances and error performance.

*An (N, K) LDPC code with code length N and message length K is defined by an *
*M*×**N sparse parity check matrix H such that any valid codeword c must satisfy **

**cH**^{T}** = 0. ** (1.1)

**M, equal to N-K, denotes the number of parity bits. The N columns of H correspond to the ****N code bits of a codeword and the M rows of H specify the M parity check constraints that **

*exactly j nonzero elements and each row contains exactly k nonzero elements, otherwise *
*it’s an irregular code. Fig. 1.2 shows an (2, 4)-regular code with N = 8 and M = 4. In *
general, irregular codes with carefully chosen row and column weights perform better than
regular codes [57]. Note that longer code lengths always lead to better error correcting
performance. The code lengths of LDPC codes employed by modern communication
systems range from a few hundreds to thousands in general.

1 0 1 0 1 0 1 0

1 0 0 1 0 1 0 1

H 0 1 1 0 0 1 1 0

0 1 0 1 1 0 0 1

=

Fig. 1.2. Parity check matrix of an (2, 4)-regular LDPC code.

**1.2 Encoding **

**Encoding of a linear block code is to uniquely map an K-bit source message s to an *** N-bit valid codeword c. The general encoding method is to find an K*×

*N generator matrix*

**G such that**

**GH**^{T}** = 0 ** (1.2)

**and then c can be obtained by **

**C = sG. ** (1.3)

**Systematic encoding that K information bits of s are a part of c is desired because s can be *** directly extracted after decoding. For a systematic code that c = [s|p] where p is the M-bit *
parity block, its generator matrix must be

**G = [I***K***|P], ** (1.4)

**where I***K*** is a size-K identity matrix and P is the parity portion of G. While H of an LDPC **

**code is very sparse, the corresponding G is always dense. The encoding complexity of this **
*straightforward method is quadratic to the code length N. A huge number of addition and *
multiplication (XOR and AND operations for binary inputs) are required to complete the
**encoding when G is large. Practical encoding algorithms should avoid matrix operations on **
dense matrices and also keep systematic encoding. This could be achieved by
**preprocessing H to certain special structure before encoding [14] or placing special **
**structural constraints on H when designing the code [58]. The class of block-type codes **
proposed in [58] and its subclass, the dual-diagonal codes adopted by IEEE standards
[3][4][7], can be encoded in linear time by exploiting their matrix structure. Now the
research focus is on designing low-latency hardware-efficient encoding algorithms for this
kind of codes.

**1.3 Decoding **

LDPC codes are usually decoded by a soft decision decoding algorithm based on
**iterative belief propagation (BP). H can be represented by a bipartite graph known as a **
Tanner graph [55]. As shown in Fig. 1.3, the Tanner graph has two disjointed sets of nodes,
*N bit nodes (BNs, also known as variable nodes) and M check nodes (CNs), which *
**correspond to the N columns and M rows of H respectively. An BN v***i* is connected to an
*CN c**j*** by an edge if and only if there exists an nonzero element in the entry (j, i) of H, i.e., ***the code bit represented by v**i** is contained in the parity check equation represented by c**j*.
Because the Tanner graph displays the incidence relationship between the code bits and the
parity check equations that check on them, it can be used to study the iterative BP decoding

*c*

_{0}*c*

_{1}*c*

_{2}*c*

_{3}*v*

_{0 }*v*

_{1 }*v*

_{2 }*v*

_{3 }*v*

_{4 }*v*

_{5 }*v*

_{6 }*v*

_{7}Fig. 1.3. Tanner graph corresponding to the parity check matrix in Fig. 1.2.

The decoding algorithm processes the received soft values iteratively to improve the reliability of each decoded bit based on the Tanner graph. The computed reliability measures of decoded bits at the end of each iteration are used as input for the next iteration.

The hard decisions are made based on these computed reliability measures of decoded bits in each iteration. The decoding process continues until a valid codeword is found or other stopping criteria are satisfied.

The standard iterative decoding algorithm is called sum-product algorithm (SPA). We
*summarize log-domain SPA as follows [10]. L(c*_{i}*) and L(Q*_{i}*) are the log-likelihood ratios *
*(LLRs) of the i-th bit of the received and corrected codeword respectively. L(q*_{ij}*) is the LLR *
*message from the BN v*_{i}* to the CN c*_{j}*. L(r*_{ji}*) is the LLR message from the CN c*_{j}* to the BN v** _{i}*.

**Log-domain Sum-Product Algorithm **
**Step 1. [Initialization] **

2

, 0

( ) ( ) - , 1 (BEC)

0, ( ) ( ) (-1) log 1 - (BSC)

( ) ( ) 2 / (BI-AWGNC)

*i*

*i*

*ij* *i* *i*

*i*

*y*

*ij* *i*

*ij* *i* *i*

*y*

*L q* *L c* *y*

*y* *E*
*L q* *L c*

*L q* *L c* *y*

ε ε σ

+∞ =

= = ∞ =

=

= =

= =

.

(1.5)

**for all i, j for which H**_{ij}* = 1. y** _{i}* is the received signal.

**Step 2. [CN Operation] **

' '

' \

' \

( ) ( )

*j*
*j*

*ji* *i j* *i j*

*i V i*
*i V i*

*L r* α φ φ β

∈

∈

=

## ∏

⋅ ## ∑

^{. }

^{(1.6) }

where

### ( )

^{1}

( ) - log tanh( / 2) log -1

*x*
*x*

*x* *x* *e*

φ = = *e* +

. (1.7)

*V**j** is the set of the BNs connected to the CN c**j*.
**Step 3. [BN Operation] **

'

' \

( ) ( ) ( )

*i*

*ij* *i* *j i*

*j C* *j*

*L q* *L c* *L r*

∈

= +

## ∑

^{. }

_{(1.8) }

( ) ( ) ( )

*i*

*i* *i* *ji*

*j C*

*L Q* *L c* *L r*

∈

= +

### ∑

^{. }

_{(1.9) }

*C**j** is the set of the CNs connected to the BN v**j*.

**Step 4. [Hard Decision Making and Stopping Condition Testing] **

1 if ( ) 0

ˆ 0 else

*i*
*i*

*c* *L Q* <

= * for i = 0, 1, …, N-1. *

If ˆcH* ^{T}* = or the number of iterations equals the maximum limit, stop; else go to 0

The decoding process requires a number of iterations to converge to a valid codeword or declare a decoding failure. For every iteration, a huge number of messages are passed between BNs and CNs with complex operation performed in each node. Thus the iterative decoder consumes considerable power and needs a long latency to decode one codeword.

Reducing both the iterations required for one codeword and the operations required in one iteration can be considered to minimize the decoder energy consumption and processing time. In addition, the decoding schedule, i.e., the processing order of BNs and CNs, can greatly affects the convergence speed of the decoder. How to alter the standard decoding schedule presented above to achieve better convergence speed with no error performance degradation is another challenge.

An important issue about the performance of LDPC codes is the error floor phenomenon. For conventional error correcting codes such as Reed-Solomon and convolutional codes, the error performance curve continuously decreases as the signal-to-noise ratio (SNR) becomes higher. However for LDPC codes with iterative decoding, when SNR is higher than a certain value, the performance curve does not decrease as quickly as lower SNRs. This segment of curve is the error floor and the corresponding SNR region is referred to as the error floor region. The error floor is undesired for practical communication systems and should be removed or lowered. The error floor performance is found to be dominated by certain graph structures such as trapping sets [13]. With this knowledge, enhanced decoding approaches such as better decoding scheduling strategies can be designed to break these bad structures and thus lower the error floor.

**1.4 Code Structure **

**The error correcting performance of the code is directly related to the structure of H **
(or its Tanner graph). The structure of Tanner graphs has been extensively analyzed in the
past to find codes with good performance. Graph conditioning techniques are developed to
prevent the structures which limit decoding performance. Among those graph structures,
*cycles are relative easy to analyze and control. A length-l cycle in a Tanner graph is a *
*closed path composed of l edges as shown in Fig. 1.4. Short cycles prevent the decoding to *
converge to optimum decoding (maximum likelihood decoding). The girth of a code is the
length of the shortest cycle in its Tanner graph. Some code construction algorithms are
**designed to construct H with a larger girth and better cycle structure [11][12]. In order to **
**lower the error floor, one can also try to construct H with better trapping-set property. **

However, these structures are more complex and make code construction more difficult.

1 0 1 0 1 0 1 0

1 0 0 1 0 1 0 1

H 0 1 1 0 0 1 1 0

0 1 0 1 1 0 0 1

=

**Length 4**
**Length 6**

*c*_{0}*c*_{1}*c*_{2}*c*_{3}*v*_{0 }*v*_{1 }*v*_{2 }*v*_{3 }*v*_{4 }*v*_{5 }*v*_{6 }*v*_{7}

**Fig. 1.4. Length-4 and length-6 cycles in H and its Tanner graph. **

**In addition, the structure of H may affect the implementation complexity and **
**throughput of the corresponding encoder and decoder. In particular, for H without any **
structural regularity, the decoder always requires high hardware cost for massage routing
and memory access. Some encoder/decoder-oriented codes have been proposed in the past,

matrix structures, low-cost partially-parallel decoding and linear-time encoding can be realized respectively. However, how to jointly optimize both error correcting performance and encoder/decoder efficiency is still a challenging problem when designing the code structure.

**1.5 Contributions and Organizations of This Thesis **

From the above discussions, we know that when designing an LDPC coding system,
issues about three different parts are needed to address: encoding, decoding, and code
structure. In this thesis, we propose several approaches for these three parts to reduce the
power consumption or processing time of LDPC coding systems while maintaining good
error correcting performance. In Chapter 2, we propose a low-latency encoding algorithm
for dual-diagonal LDPC codes which are widely adopted by many next generation
communication standards. For LDPC decoders which are usually more complex and
consume more power than encoders, we first propose techniques to reduce the overall
decoding operations in Chapter 3. Then intelligent scheduling strategies are presented to
**improve the convergence speed and error performance. In Chapter 4, low-complexity early **
detection of successful decoding (for dual-diagonal codes) and early termination of
unsuccessful decoding are proposed to save the required decoding iterations. Finally in
Chapter 5, we design a new class of long-length structured LDPC codes with
considerations of both error performance and decoder implementation complexity. Chapter
6 concludes this thesis and presents some possible future works.

**Chapter 2 **

**Low-Latency Encoding Algorithm for **

**Dual-Diagonal Codes Based on Two-Way ** **Parity Bit Correction **

**2.1 Motivation **

One drawback of LDPC codes is their high encoding complexity resulted from dense generator matrices. The use of generator matrices in encoding process can be avoided by employing dual-diagonal matrix structure. LDPC codes with dual-diagonal structure is adopted by IEEE 802.11n [3], IEEE 802.16e [4], and IEEE 802.20 [7] standards. This class of codes can be encoded in near-linear time by Richardson and Urbanke’s (RU) method [14]

and in linear time by the sequential method in [15]. RU-based encoder designs [16][17][18][19] reduce the encoding complexity by multiplying with mostly sparse matrices and relatively small dense matrices. On the other hand, sequential algorithm based designs [20][21] involve only operations with sparse matrices. The encoders in [18][19][20][21] are customized for IEEE 802.11n or IEEE 802.16e LDPC codes. The works in [21] and [19] achieve the highest encoding throughput but also require a long encoding latency. Moreover, due to the data dependency of the sequential encoding

algorithm, the hardware resource cannot be shared efficiently for codes with different code rates and code lengths. Note that the encoding throughput can be increased simply by interleaving multiple encoder instances [16][19]. Thus the most important metric to evaluate the encoder efficiency is the throughput/area ratio. The dual-diagonal matrix structure is exploited in the arbitrary bit generation and correction encoding algorithm [22][23]. This approach achieves low encoding complexity and reduces the encoding latency. However, their algorithm places a special limit on matrix structure that is incompatible with IEEE 802.11n and IEEE 802.16e standards. From their results in [22], the matrix modification causes error correcting performance degradation with higher error floors compared to the original IEEE 802.11n codes.

In this section, we propose a generalized two-way prediction and correction based encoding scheme. The proposed scheme places no limitation on the dual-diagonal matrix structure. Our algorithm can be directly applied to encode IEEE 802.11n and IEEE 802.16e LDPC codes. The encoding latency is lowered thanks to less data dependency in our algorithm. Both serial and parallel architectures are implemented on FPGA to demonstrate the improvement on throughput and throughput/area ratio. A multi-rate IEEE 802.16e encoder is also implemented to show efficiency in hardware sharing. The remainder of this chapter is organized as follows. Section 2.2 introduces dual-diagonal LDPC codes. Section 2.3 presents the proposed encoding algorithm and compares it with other methods. The encoder architecture for the proposed algorithm is described and analyzed in section 2.4.

Section 2.5 shows the hardware implementation results and comparisons with related works.

Finally section 2.6 summarizes this chapter.

**2.2 Dual-Diagonal LDPC Codes **

**The dual-diagonal parity check matrix H of size M**^{×}*N in IEEE 802.11n and 802.16e is *
defined as

0,0 0,1 0,2 0, 1

1,0 1,1 1,2 1, 1

s p 2,0 2,1 2,2 2, 1

1,0 1,1 1,2 1, 1

P P P P

P P P P

H= (H ) | (H ) = P P P P

P P P P

*b*
*b*
*b*

*b* *b* *b* *b* *b*

*n*
*n*

*M K* *M M* *n*

*m* *m* *m* *m* *n*

−

−

× × −

− − − − −

(2.1)

**where H**_{s}** corresponds to the information bits and H**_{p}** corresponds to the parity bits. P***_{i,j}* is

*either a circulant permutation matrix or a zero matrix of size z. A circulant permutation*

*matrix is formed by circularly shifting the rows of an identity matrix of size z to the right*

**by certain locations. H is expanded from a base matrix H**

**b**

s p

b b b

H = (H ) | (H )

*b* *b* *b* *b*

*m*×*k* *m*×*m*

(2.2)

*where m**b** = M/z and k**b*** = K/z. Each element in H****b** is a nonnegative number to represent the
shift quantity of the corresponding permutation matrix or -1 to represent a zero matrix
**respectively. The structure of H**** _{bp}** is further defined as

bp 1 ( 1)

0 0 0

0

H = (t) | (h) 0

0 0 0

0

*b* *b* *b*

*m* *m* *m*

*d*

*d*

× × −

=

(2.3)

* where h represents the dual-diagonal portion. Note that d is a positive number and all blank *
entries are zero elements. By exploiting this structure, the codeword can be encoded
recursively in linear time and the encoder complexity can be reduced significantly.

However, the data dependency in the process increases the number of clock cycles needed to encode a codeword.

*The LDPC code lengths (N) in IEEE 802.11n are 648, 1296, and 1944 with sub-block *
*size z = 27, 54, 81 respectively. The code lengths in IEEE 802.16e range from 576 to 2304 *
*with z = 24 to 96. Both standards support code rates of 1/2, 2/3, 3/4, and 5/6. Fig. 2.1 shows *
a sample code matrix in IEEE 802.16e with rate 1/2.

- 94 73 - - - - - 55 83 - - 7 0 - - - - - - - - - -

- 27 - - - 22 79 9 - - - 12 - 0 0 - - - - - - - - -

- - - 24 22 81 - 33 - - - 0 - - 0 0 - - - - - - - -

61 - 47 - - - - - 65 25 - - - - - 0 0 - - - - - - -

- - 39 - - - 84 - - 41 72 - - - - - 0 0 - - - - - -

- - - - 46 40 - 82 - - - 79 0 - - - - 0 0 - - - - -

- - 95 53 - - - - - 14 18 - - - - - - - 0 0 - - - -

- 11 73 - - - 2 - - 47 - - - - - - - - - 0 0 - - -

12 - - - 83 24 - 43 - - - 51 - - - - - - - - 0 0 - -

- - - - - 94 - 59 - - 70 72 - - - - - - - - - 0 0 -

- - 7 65 - - - - 39 49 - - - - - - - - - - - - 0 0

43 - - - - 66 - 41 - - - 26 7 - - - - - - - - - - 0

*Fig. 2.1. Base matrix of rate 1/2 and N = 2304 IEEE 802.16e LDPC code. *

**2.3 Proposed Encoding Procedure **

**We denote the information block s = [a**0 a1 … a*k-1***] and the information sub-block s*** i* =
[a

*iz*a

*iz+1*… a

*(i+1)z-1*

*] for i = 0, 1, …, k*

*b*

**-1. Also, let p = [b**0 b1 … b

*m-1*] denote the parity block

**and p**

*= [b*

_{i}*b*

_{iz}*… b*

_{iz+1}

_{(i+1)z-1}*] for i = 0, 1, …, m*

*-1 denote the parity sub-block. The*

_{b}**prediction vector p**

**i****’ = [b’**

*iz*b’

*iz+1*… b’

*(i+1)z-1*

**] is defined as the predicted solution of p**

*. The*

**i****row index of the nonnegative entry in the middle of the weight-3 column in H**

**bp**is denoted

*by x. Note that all operations discussed in the following are modulo-2 operations.*

**2.3.1 ** **Encoding Concept **

**By definition, a valid codeword c = [s|p] must satisfy the following equation **

### [ ]

^{t}

s p

Hc* ^{t}* =(H )

_{M K}_{×}| (H )

_{M M}_{×} s | p =0. (2.4)

**Replacing H**

**s**

**and H**

**p**by the dual-diagonal matrix definition in section 2, we get

t t

0,0 0,1 0,2 0, 1 0 0

t t

1,0 1,1 1,2 1, 1 1 1

t t

2,0 2,1 2,2 2, 1 2 2

t

1,0 1,1 1,2 1, 1 1 1

P P P P s I I p

P P P P s I I p

P P P P s I p

I

I I I

I I

P P P P s p

*b*
*b*
*b*

*b* *b* *b* *b* *b* *b* *b*

*k* *d*

*k*
*k*

*m* *m* *m* *m* *k* *k* *d* *m*

−

−

−

− − − − − − −

+

^{t}

0

=

(2.5)

**where I is the identity matrix and I**_{d}* is the circulant permutation matrix with d-position row *
shifting to the right. After matrix multiplication, we obtain

t t

0 1

t t

t 1

1 2

0, 0

t 1

1,

0 t t

-1

t t t

0 +1

t t

+1 +2

t t

t 1

2 1

1, 0

t t

0 1

(p ) p

p p

P s P s

p p

p p p 0

p p

p p

P s

(p ) p

*b*
*b*

*b*

*b* *b*

*b*

*b*

*d*
*k*

*j* *j*
*j*

*k*

*j* *j*
*j*

*x* *x*

*x* *x*

*x* *x*

*k*

*k* *k*

*m* *j* *j*

*j*

*d* *k*

=−

=−

− − − −

=

−

+

∑ +

∑

+

+ +

+ =

+

∑ +

+

(2.6)

**where (p**** _{0}**)

_{d}**is p**

_{0}*with d-position circularly shifting to the left. Sum all rows in (2.6), we get*

1

0 0

p =

### ∑

^{m}

_{i}_{=}

^{b}^{−}λ

_{i}^{(2.7) }

where ^{t} ^{1} _{,} ^{t}

0 P s

*k**b*

*i* *j* *i j* *j*

λ =

## ∑

_{=}

^{−}

**. For sequential encoding, p**

**must be calculated first by a series of**

_{0}**shifting and accumulation in (2.7). Then p****1**** to p*** mb-1* can be obtained by forward or
backward substitution through the equations in (2.6). Due to the data dependency among

**the parity sub-blocks, the calculation of p**

_{0 }**and the derivation of p**

_{1}**to p**

*can not be parallelized to save time.*

_{mb-1}** In our proposed approach, instead of calculating p**** _{0}** first, we jump start the calculation

**by setting p**

**0**

**as an arbitrary vector p**

**0**

**’ and immediately calculate the prediction vectors p**

**i****’s.**

**p**_{1}**’ and p**_{mb-1}**’ can be obtained by ** p '_{1} =λ_{0}+(p ' )_{0} * _{d}* and p

*m*

_{b}_{-1}'=λ

*m*

_{b}_{-1}+(p ')

_{0}

*d*

**respectively. Then the other prediction vectors p****i****’s are obtained by forward substitution **

-1 -1

p '* _{i}* =λ

*+p '*

_{i}*(2.8)*

_{i}*for i = 2, 3, …, x and backward substitution *

p '* _{i}* = +λ

*p*

_{i}

_{i}_{+}1'. (2.9)

*for i = m**b**-2, m**b**-3, …, x+1. After backward substitution, an additional operation *
p*temp* =λ*x*+p*x*_{+}1'** is needed and p***_{temp}* will be used in the following computation. It is known
there exists a relationship (p )0

*d*+(p ')0

*d*= +p

*i*p '

*i*

*for i = 1, 2, …, m*

*b*-1. We define the

**correction vector f as**

0 0 0 0

f =(p )* _{d}* +(p ')

*=(p +p ')*

_{d}*. (2.10)*

_{d}**f is unknown right now because p**

**is not available. However, it is known that**

_{0}1 1 1

p* _{x}*+p

_{x}_{+}=(p

*+ +f) (p*

_{x}

_{x}_{+}+ =f) p ' p

*+*

_{x}

_{x}_{+}'. (2.11)

**Hence from the x-th equation in (2.6), we can calculate p****0**by

0 1 1

p =λ* _{x}*+p

*+p*

_{x}

_{x}_{+}=λ

*+p ' p*

_{x}*+*

_{x}

_{x}_{+}'=p ' p

*+*

_{x}*. (2.12)*

_{temp}**Then f can be obtained by (2.10). If we set p**

**0**

**’ as a zero vector at first, then f is just (p**

**0**)

*d*,

**the shifted version of p**

**. Finally, the other parity sub-blocks can be obtained by**

_{0}p* _{i}* =p ' f

*+ . (2.13)*

_{i}**2.3.2 ** **Proposed Encoding Scheme **

Based on the previous discussion, our proposed encoding algorithm is summarized as follows.

**Proposed Scheme **

Step_1._Set p**0****’ (i.e., b’**0, b’1, …, b’*z-1*) as any binary vector.

Step_2._Compute the vector **λ = H****s****s by circularly shifting and accumulating the **
**sub-blocks of s. (We denote ** λ = [c0 c_{1 … }c* _{M-1}*] and λ = [c

_{i}*iz*c

*… c*

_{iz+1}

_{(i+1)z-1}*] for i*

*= 0, 1, …, m**b*-1).

Step_3._[Forward Derivation] Compute p_{1}**’, p**_{2}**’, … , p**_{x}**’. **

Step_4._[Backward Derivation] Compute p_{mb-1}**’, p**_{mb-2}**’ … , p**_{x+1}**’and p***_{temp}*.
Step_5._Compute p

**0**

**by adding p**

**x****’ and p**

*.*

**temp**Step_6._Compute the correction vector f by circularly shifting the sum of p**0**** and p****0****’ to the **
*left by d positions. *

Step_7._[Correction] If f is a nonzero vector, then compute p_{i}** by adding p**_{i}**’ and f for i = 1 ***to m**b***-1. Otherwise, p****i**** is simply p****i****’. **

Comparing with the sequential algorithm, the proposed algorithm reduces encoding
**latency in the following places. In Step 1, p****0****’ is set arbitrarily instead of being computed by **
**the matrix operations in (2.7). Then in Step 3 and Step 4, parity sub-blocks p****1****’, p****2****’, … , **
**p**_{mb-1}**’ can be obtained without knowing p**_{0}**. Step 5 and Step 6 compute p**_{0}** and f from (2.12) **
and (2.10) respectively. In addition, there is no dependency between Step 3 and Step 4, so
forward and backward derivation can be executed simultaneously. Since the algorithm
proposed in [22] can only generate these bits by forward substitution, our approach reduces
the encoding delay further.

**2.3.3 ** **Encoding Example **

*An encoding example is illustrated in Fig. 2.2. The submatrix size z in this example is *
4. At first, b’0, b’1, b’2, b’3 are set to zeros (Step 1). The valueλ** = H****s****s is calculated in Step **
2. With these four bits, we can obtain b’4, b’5, …, b’15 (Step 3) and b’23, b’22, …, b’16

**(Step 4) by forward and backward derivation, respectively. p***_{temp}* = [d

_{0}d

_{1 }d

_{2 }d

_{3}] can be calculated by using b’16, b’17, b’18, b’19 and c12, c13, c14, c15

**. After that, we can easily find p**

**0**

**by adding p****3****’ and p****temp**** (Step 5). Then the correction vector f is the summation of p****0** and
**p**_{0}**’ with one-position left circular shifting (Step 6). At last, f is added to p**_{1}**’, p**_{2}**’, p**_{3}**’, p**_{4}**’, **
**p**_{5}**’ to generate the other parity bits b**_{4}, b_{5}, …, b_{23 } (Step 7). The final solution we obtained
**is p = [1 1 0 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1]. **

**2.4 Proposed Encoder Architecture **

In this section, we present a parallel and a serial architecture based on the proposed encoding algorithm. The hardware complexity and throughput of both architectures are also compared.

**2.4.1 ** **Parallel Architecture **

Fig. 2.3 shows the block diagram of the proposed parallel architecture. Our architecture takes advantage of the two-way prediction to achieve higher level parallelism.

The encoder architecture is composed of the following four stages. Note that the initial
**prediction vector p**_{0}**’ is set to zero to simplify the hardware. **

Fig. 2.3. Proposed parallel encoder architecture.

**This first stage carries out the multiplication of submatrices in H****s** and information
**sub-blocks (i.e., s***_{i}*) by multiplexers and barrel shifters (part of Step 2 in the proposed
scheme). As shown in Fig. 2.4, the multiplication proceeds in a two-way fashion from

**topmost and bottommost block rows of H****s** simultaneously to reduce the encoding latency.

*Conventional sequential encoder architecture such as [15] uses k**b** barrel shifters for parallel *
**computation of H****s****s. k***b* is 12 for the IEEE 802.16e (2304, 1152) code. Nevertheless, in
every computation cycle, certain barrel shifters will be idle due to the existence of zero
**submatrices in H****s**. To minimize the number of idle barrel shifters, our proposed
architecture uses 2×*α barrel shifters to process two block rows simultaneously, where α is *
**the maximum number of nonzero submatrices in one block row in H**** _{s}**. Multiplexers are
used to select information sub-blocks corresponding to nonzero submatrices. In latter

**stages, forward derivation corresponding to the upper part of H**

**and backward derivation**

_{s}**corresponding to the lower part of H**

**s**can also be done in parallel. For the 802.16e (2304, 1152) code, only 10 barrel shifters are needed to achieve two-way computation.

- 94 73 - - - - - 55 83 - - 7 0 - - - - - - - - - -

- 27 - - - 22 79 9 - - - 12 - 0 0 - - - - - - - - -

- - - 24 22 81 - 33 - - - 0 - - 0 0 - - - - - - - -

61 - 47 - - - - - 65 25 - - - - - 0 0 - - - - - - -

- - 39 - - - 84 - - 41 72 - - - - - 0 0 - - - - - -

- - - - 46 40 - 82 - - - 79 0 - - - - 0 0 - - - - -

- - 95 53 - - - - - 14 18 - - - - - - - 0 0 - - - -

- 11 73 - - - 2 - - 47 - - - - - - - - - 0 0 - - -

12 - - - 83 24 - 43 - - - 51 - - - - - - - - 0 0 - -

- - - - - 94 - 59 - - 70 72 - - - - - - - - - 0 0 -

- - 7 65 - - - - 39 49 - - - - - - - - - - - - 0 0

43 - - - - 66 - 41 - - - 26 7 - - - - - - - - - - 0

Fig. 2.4. Encoding order for the proposed parallel encoder architecture.

In the second stage, 2×*α z-bit XOR gates are used for two-way computation of *λ*i*s
**(part of Step 2) and prediction parity vectors p**_{i}**’s (Step 3 and Step 4). p**_{i}**’s are computed as **
soon as any _{λ}* _{i}* is available. The computation of

_{λ}

_{i}**s and p**

_{i}**’s are pipelined to reduce the**

**meantime p****0**** and the correction vector f = (p****0**)*d* are obtained by one XOR operation (Step 5
**and Step 6). In the last stage, p****i****’s are corrected by f via XOR gates (Step 7). There is no **
**data dependency among these correction operations so p****i****’s can be computed in parallel. **

**2.4.2 ** **Serial Architecture **

We also propose a serial encoder architecture for low throughput applications. Fig. 2.5 shows the block diagram of the proposed architecture. Encoding operations are scheduled to reduce idling hardware. The architecture is composed of the following three stages.

Fig. 2.5. Proposed serial encoder architecture.

The first stage employs two barrel shifters for matrix multiplication, a reduction by α
times of the parallel architecture. The information sub-blocks corresponding to the upper
**and lower part of H****s** are serially selected and processed by the two barrel shifters
respectively. Fig. 2.6 shows the processing order for the multiplication operations. All
nonzero submatrices are skipped, so at most α cycles are required for each block row
*instead of k**b** cycles. *