國立交通大學
電子工程學系 電子研究所碩士班
碩士論文
適用於快閃記憶體之二位元軟輸入(9153,8256)
低密度奇偶校驗碼解碼器之設計與實作
Design and Implementation of a (9153,8256) LDPC Decoder
with 2-bit Soft Input for NAND Flash Memory
學生:何堅柱
適用於快閃記憶體之二位元軟輸入(9153,8256)
低密度奇偶校驗碼解碼器之設計與實作
Design and Implementation of a (9153,8256) LDPC Decoder
with 2-bit Soft Input for NAND Flash Memory
研 究 生:何堅柱 Student : Kin-Chu Ho
指導教授:張錫嘉 博士 Advisor : Dr. Hsie-Chia Chang
國立交通大學
電子工程學系 電子研究所 碩士班
碩士論文
A Thesis
Submitted to Department of Electronics Engineering & Institute Electronics College of Electrical and Computer Engineering
National Chiao Tung University In Partial Fulfillment of the Requirements
for the Degree of Master of Science
in
Electronics Engineering August 2010
適用於快閃記憶體之二位元軟輸入(9153,8256)
低密度奇偶校驗碼解碼器之設計與實作
學生:何堅柱 指導教授:張錫嘉 博士
國立交通大學
電子工程學系 電子研究所碩士班
摘要
BCH碼因為硬體架構非常簡單,目前是應用在快閃記憶體系統上錯誤更正碼 的主流。面對先進製程的發展與記憶體儲存容量的大幅提升所造成可靠度的降低, 以代數解碼演算法為主的BCH碼只能不斷增加校驗碼的數量來提升解碼效能,如 此一來也間接地減少資料所能儲存的空間。據此,本論文提出適用於快閃記憶體 系統的低密度奇偶校驗碼(Low Density Parity Check, 簡稱LDPC Codes)及其解 碼器架構,以二位元軟輸入之LDPC Codes提供在相同編碼率下比BCH碼更好的 錯誤更正能力。由於下世代快閃記憶體的儲存頁碼大小為1024Bytes,我們使用permutation matrix 演 算 法 建 出 編 碼 率 為 0.9 的 (9153,8256) LDPC Codes , 並 利 用 variable-node-centric sequential scheduling (簡稱VSS)來降低檢查節點運算元之電
Design and Implementation of a (9153,8256) LDPC Decoder
with 2-bit Soft Input for NAND Flash Memory
Student : Kin-Chu Ho Advisor : Hsie-Chia Chang
Department of Electronics Engineering
Institute of Electronics
National Chiao Tung University
Abstract
This thesis proposes a LDPC decoder architecture for NAND flash memory system.BCH code is famous for NAND flash memory system because of its simple hardware architecture. However, advanced technology scale down and more bits of data stored per NAND Flash cell will cause the degradation of reliability. More parity bits are required to improve the correcting capability of BCH code. But this greatly degrades the storage capacity and is infeasible to commercial products. Soft input is required to improve the correcting capability of error correcting code. However, BCH code has only little improvement when soft input is provided. This thesis proposes a 2-bits soft input LDPC decoder, which can outperform BCH code under same code rate.
The (9153, 8256) LDPC code is constructed by permutation matrix algorithm with code rate 0.9. The variable-node-centric sequential scheduling (VSS) architecture is adopted and CNU is modified to reduce hardware complexity. Compared to the conventional Min-Sum two-stage pipelined architecture, the proposed architecture can reduce approximately 96% combination circuits of VNU and 76.8% registers. Using 90nm CMOS technology, the maximum throughput can achieve 2.78 Gbps under operating frequency of 100 Mhz with 10 iterations.
誌謝
不知不覺兩年的碩士生活就要結束了,要感謝很多人對我的照顧與幫忙。首 先最要感謝是我的父母,很感激他們對我的支持。大學加碩士這六年,我都沒有 辦法長時間陪伴在他們身邊,只有寒暑假才可以短暫回家探望他們。但他們還是 沒有抱怨,支持我去做我想做的事。 我也要感謝我的指導教授張錫嘉老師,除了在學術研究上的指導外,也很關 心我的生活狀況,很感謝他對我的包容。還有就是 LDPC GROUP 的陳志龍學長 和嚴紹維學長,除了細心指導研究以外,還常常帶我去體驗新竹美食,對我非常 照顧。 最後要感謝 OCEAN 與 OASIS 的每一位伙伴。一起在研究上共同奮鬥,一 起聊天吃飯,慢慢培養了大家的感情。天下無不散之筵席,不少伙伴也在今年要 離開 OCEAN 這個大家庭。雖然有點不捨,但也衷心祝福大家前程似錦。 其實心裡還有很多人想要感謝,但篇幅有限。最後讓我再次感謝每一位,謝 謝你們的照顧與幫忙,謝謝您們!!Table of Contents
List of Figures . . . iii
List of Tables . . . v
Chapter 1 Introduction . . . 1
1.1 Motivation . . . 1
1.2 Thesis organization . . . 2
Chapter 2 NAND Flash Memory . . . 3
2.1 Introduction of NAND Flash Memory . . . 3
2.1.1 Flash Memory System . . . 3
2.1.2 NAND Flash Cell Programming . . . 4
2.1.3 NAND Flash Cell Erasing . . . 4
2.1.4 NAND Flash Cell Reading . . . 5
2.2 Reliability of NAND Flash Memory . . . 6
2.2.1 Electron Leakage . . . 6
2.2.2 Program Disturb . . . 6
2.2.3 Read Disturb . . . 7
Chapter 3 Low Density Parity Check Code . . . 10
3.1 Decoding Algorithm . . . 10
3.1.1 Standard Belief Propagation (BP) Algorithm . . . 10
3.1.2 Variable-node-centric Sequential Scheduling (VSS) Algorithm . . . 11
3.2 Performance-Related Parameters . . . 14
3.2.1 Cycles in Tanner Graph . . . 14
3.2.2 Column Degree . . . 15
3.3 Code Construction . . . 18
3.3.1 Permutation Matrix Algorithm . . . 18
3.3.2 Code Performance . . . 21
Chapter 4 LDPC Decoder Architecture . . . 22
4.1 Single Pipelined Architecture for VSS Algorithm . . . 22
4.2 Check Node Unit (CNU) . . . 24
4.2.1 Accumulative Sorter . . . 24
4.2.2 Accumulative Sorter without 2nd minimum value . . . 25
4.3 Varible Node Unit (VNU) . . . 28
4.4 Shifting Network . . . 29
Chapter 5 Simulation and Implementation Result . . . 33
5.1 Quantization . . . 33
5.2 Performance . . . 35
5.3 Throughput . . . 36
5.4 Implementation Results . . . 37
Chapter 6 Conclusion and Future Work . . . 39
6.1 Conclusion . . . 39
6.2 Future Work . . . 40
List of Figures
2.1 The Block Diagram of Flash Memory System. . . 4
2.2 NAND Flash Cell Programming [1]. . . 4
2.3 Threshold voltage distribution of a Signle Level Cell of NAND Flash Mem-ory [1]. . . 5
2.4 NAND Flash Cell Erasing [1]. . . 5
2.5 NAND Flash Cell Reading [1]. . . 6
2.6 Program Disturb. . . 7
2.7 Read Disturb. . . 7
2.8 Threshold voltage distribution of a 2bits/cell NAND flash cell. . . 8
2.9 Threshold voltage distribution of a 2bits/cell NAND flash cell. . . 9
3.1 Illustratin of standard BP. . . 12
3.2 Illusion of VSS. . . 14
3.3 An example of a tanner graph with cycle-6. . . 15
3.4 Performance of LDPC code with different column degree. . . 16
3.5 Performance of LDPC code with different column degree. . . 17
3.6 Performance of LDPC code with different column degree. . . 18
3.7 An example of QC LDPC code, dc = 3, dv = 2 and p = 4. . . 19
3.8 Demonstration of cycle-4. . . 20
3.9 Pariyt check matrix H. . . 20
3.10 Performance of (9153, 8256) LDPC code. . . 21
4.1 Architecture and scheduling for VSS algorithm. . . 23
4.2 Conventional accumulative sorter. . . 24
4.3 Demonstration of conventional accumulative sorter. . . 25
4.4 Accumulative sorter w/o 2nd min. . . 26
4.5 Demonstration of accumulative sorter w/o 2nd min. . . 26
4.6 Performance of (9153, 8256) LDPC code with different global 2ndmin com-pensation, MS - MinSum, MS-VSS - MinSum with variable-node-centric sequential scheduling. . . 27
4.7 Variable node unit architecture. . . 28
4.8 Illusion of messages shifted between CNUs. . . 29
4.9 Parity Check Matrix of (9153,8256) LDPC code. . . 30
5.1 2 bits (4 levels) non-linear quantization. . . 33
5.2 Performance of (9153, 8256) (Column deg = 8) LDPC code with different parameters. . . 34
5.4 Performance comparison, Iteration = 40. . . 36 5.5 Layout of Place and Route. . . 38
List of Tables
5.1 Synthesis result of CNU and VNU with technology UMC90. . . 36 5.2 Summary of implementation result (Place and Route). . . 37
Chapter 1
Introduction
1.1
Motivation
Error correcting code is important to NAND flash memory system since error is un-avoidable [1]. BCH code [2] [3] is famous for NAND flash memory system because of its simple hardware architecture and hard input requirement. As advanced technology scaled down and more bits of data stored per NAND flash cell, more errors are introduced. Under the limitation of number of parity bits, the correcting capability of BCH code is not enough to meet the requirement of next generatation NAND flash emory system. Soft input is required to improve the correcting capability of error correcting code. However, BCH code has only little improvement when soft input is provided [4] [5]. LDPC code [6] is a good candidate for its powerful correcting capability and simple decoding algorithm. 2-bit soft LDPC code can outperform BCH code with same code rate.
Low density parity check (LDPC) code is a famous error correcting code with near Shannon limit performance [7]. The parity check matrix H can be described by a Tanner graph [8]. The rows and columns of H are mapped to check nodes and variable nodes respectively. In standard belief propagation (BP) algorithm, a LDPC decoder exchanges messages between check nodes and variable nodes iteratively in fully parallel.
High code rate is a necessary condition for error correcting code applied on NAND flash memory system. A high code rate LDPC code introduces large row degree which causes implementation difficulty. The proposed LDPC code has a row degree of 81. The
This greatly reduces the routing complexity and storage memory. A (9153, 8256) LDPC code is constructed by permutaion matrix algorithm with code rate is 0.9. The proposed LDPC code decoder has a better performance than BCH code with the same code rate when 2-bit soft input is provided. The maximum throughput can achieve 2.78 Gbps under operating frequency of 100Mhz with 10 iterations, using 90nm CMOS technology.
1.2
Thesis organization
The rest of this thesis is organized as follows. Chapter II gives the introduction of NAND flash memory. In Chapter III, we introduce the decoding algorithm, performance-related code paramemters and code construction. In Chapter IV, decoder architecture is presented. The simulation result is given in Chapter V and conclusion in Chapter VI.
Chapter 2
NAND Flash Memory
2.1
Introduction of NAND Flash Memory
This section introduces the flash memory system and basic operations : Programming, Erasing and Reading.
2.1.1
Flash Memory System
Flash memory is widely used for data storage in portable devices. Since flash memory is non-volatile, no power is needed to maintain the information stored. In addition, flash memory offers fast read access times comparing to hard disk. In this thesis, we take a NAND flash memory as the target flash memory.
There are three basic operations in NAND flash memory called programming, erasing and reading. NAND flash memory can be programmed and erased block by block. Each block contains number of pages. NAND flash memory can be read page by page. More details of these three operations will be presented in next section.
Fig. 2.1 shows the flash memory system. Data are transmitted in pages where a page size is equal to 4K or 8K bytes. One single page consists of data area and spare area. The data area stores the user data, and the spare area stores the system-control signal and parity bits of error correcting code (ECC). Pages are encoded before programming, and decoded after reading from flash memory.
Flash
Memory
Buffer
ECC
System
Figure 2.1: The Block Diagram of Flash Memory System.
2.1.2
NAND Flash Cell Programming
Fig. 2.2 shows a NAND flash Cell Programming. In a NAND flash Cell, there is a Floating Gate between the Gate and Substrate. When data is written into NAND flash Cell, 0V is applied to the Source and Drain. A high voltage (VG) is applied to the Gate.
Electrons in Substrate are attracted to the Floaging Gate. Different (VG) can be applied
to control the amount of electrons injected in Floating Gate. The amount of electrons injected in Floating Gate determines the threshold voltage of a NAND flash Cell.
Substrate
0V 0V
VG
Figure 2.2: NAND Flash Cell Programming [1].
A Single Level Cell (SLC) means that only 1 bit data is stored per cell. Therefore, the threshold voltage region of a SLC is divided into two levels. Fig. 2.3 shows the threshold voltage distribution of SLC. For example, the threshold voltage is controlled to 2.5V if data 1 is stored, or 5.5V if data 0 is stored. There is variation of threshold voltage due to noise disturb and will be introduced in the next subsection.
2.1.3
NAND Flash Cell Erasing
Electrons in Floating Gate must be erased before reprogramming. When NAND flash Cell is earsed, 0V is applied to the Source, Drain and Gate. And high voltage (VS) is
Figure 2.3: Threshold voltage distribution of a Signle Level Cell of NAND Flash Memory [1].
applied to the Substrate. Electrons in Floating Gate are attracted to the Substrate and no more electrons are left in Floating Gate.
Substrate
0V 0V
0V
VS
Figure 2.4: NAND Flash Cell Erasing [1].
2.1.4
NAND Flash Cell Reading
Fig. 2.5 shows NAND flash Cell Reading. To read a NAND flash cell, the selected wordlines are grounded and high voltage (VD) is applied to the unselected wordlines. A
bias is applied to the bitlines. Current will flow through the transistor if there is no charge stored in the cell.
0V VD VD Unselected WL VD Unselected WL VD Selected WL 0V Bit Line VBIAS
Figure 2.5: NAND Flash Cell Reading [1].
2.2
Reliability of NAND Flash Memory
Electron leakage, program and read disturb cause the variation of threshold voltage of NAND flash cell. Errors may be introducted if the threshold voltage shifts to other level. More details about noise disturb will be introduced in this subsection.
2.2.1
Electron Leakage
The number of electrons stored in Floating Gate decreases over time because electrons may leak from the NAND flash Cell. This problem can be solved by erasing and repro-gramming periodly. But NAND flash Cell may be damaged when number of Program / Erase cycles increases. Leakage will be more serious if NAND flash Cell is damaged. Errors become unavoidable if NAND flash Cell is desired for a long time use.
2.2.2
Program Disturb
Fig. 2.6 shows the program disturb of a NAND flash Cell. Unselected cells on the same wordline or on adjacent wordlines of programmed cell, may suffer from voltage stress resulting in unwanted programming. Therefore, the threshold voltage of those unselected cells increases and may shift to other level.
0V Unselected WL 10V Unselected WL 10V Selected WL 20V 0V VCC VCC VCC VCC Program Disturb Cells Programmed Cell
Figure 2.6: Program Disturb.
2.2.3
Read Disturb
Unselected cells adjacent to cells being read may suffer from voltage stress resulting in unwanted programming. As in program disturb case, the threshold voltage of those unselected cells increases and may shift to other level.
0V Unselected Page 4.5V
Unselected Page 4.5V
Selected Page 0V Read Disturb
Cells
VBIAS
VBIAS VBIAS 4.5V
4.5V
In Fig. 2.3 , threshold voltage below 4V represents data 1 is stored, and threshold voltage above 4V represents data 0 is stored. There is a tolerance range for the variation of threshold voltage. Data is still correct if the threshold voltage does not shift to other level.
Figure 2.8: Threshold voltage distribution of a 2bits/cell NAND flash cell.
Fig. 2.8 shows a 2bits/cell NAND flash cell. The storage capacity is doubled comparing to the 1bit/cell NAND flash cell. Threshold voltage region is divided into 4 levels and region for each level is narrower. Therefore, the probability of threshold voltage shifting to other level is increased and led to degradation of reliability.
Nowadays, NAND flash memory system only provides hard input to error correcting code. For example, in Fig. 2.8, only three voltages (3.2V, 4V and 5.1V) are applied to check in which level the threshold voltage is. NAND flash memory system does not provide any information that how likely this bit to be ’0’ or ’1’. Information received by error correcting code is exactly ’0’ or ’1’. We call this hard input.
BCH code is feasible for its simple hardware architecture and only hard input require-ment. However, advanced technology scale down and more bits of data stored per NAND flash cell will cause the degradation of reliability. More parity bits are required to im-prove the correcting capability of BCH code. The increase of spare area (area for parity bits storage) greatly degrades the data storage capacity and is infeasible to commerical product. To overcome this problem, NAND flash memory system will provide more infor-mation (soft input) in the next generation standard and much powerful error correcting
code can be adopted.
Figure 2.9: Threshold voltage distribution of a 2bits/cell NAND flash cell.
In Fig. 2.9, if data 01 is stored and threshold voltage shifts to 5.5V, hard input only provides that the second bit is a ’0’. More information can be provided if one more voltage (5.8V) is applied to Gate. We can know that the threshold voltage is less than 5.8V, and the second bit has a high probability of being ’1’. This provides more information for each data bit to error correcting code and we call this soft input.
BCH code has only little improvement when soft input is provided [4] [5]. LDPC code is probability-based and soft information can be well-used. Therefore, LDPC code is a good candidate for the next generation NAND flash memory system. Providing soft input will inrease reading latency in flash memory system. This is a trade-off between correcting capability and system latency. This thesis shows that only 2-bits soft input LDPC code can outperform BCH code under same code rate. Therefore, degradation to system latency is minimized.
Chapter 3
Low Density Parity Check Code
LDPC code was first discovered by Gallager [6] in the early 1960s. But it does not at-tract great attention until 1900s. The main reason is the high routing complexity making implementaion very difficult. Decoding algorithm of LDPC code is iterative message-passing decoding. Messages are passed between Check Node Unit (CNU) and Variable Node Unit (VNU) during decoding process. This iterative message-passing algorithm pro-vides superior correcting ability and makes LDPC code widely adopted in communication application.
In this section, decoding algorithm will be introduced and performance-related code paramemters will be discussed. Finally, a code construction algorithm will be introduced.
3.1
Decoding Algorithm
3.1.1
Standard Belief Propagation (BP) Algorithm
The log-likelihood ratio (LLR) of intrinsic information of nth variable node is denoted
by Pn. The message from nth variable node to mth check node is denoted by zmn. The
message from mth check node to nth variable node is denoted by ǫ
mn. The a posteriori
LLR of nth bit is denoted by z
n. The current number of iteration and maximum number
of iteration is represented by i and IM ax respectively. The standard BP is carried out as
followed.
1.Initialzation:
Set i = 1. For each m, n, set z0
2.Iterative Decoding:
(a)check node to variable node update step, for 1 ≤ m ≤ M and each n ∈ N (m), process ǫimn = 2 tanh−1( d Y n′∈N (m)\n tanh(z i−1 mn′ 2 )) (3.1)
(b)variable node to check node update step, for 1 ≤ n ≤ N and each m ∈ M (n), process zi mn= Pn+ X m′∈M (n)\m ǫi m′n (3.2) zi n = Pn+ X m′∈M (n) ǫi m′n (3.3) 3.Hard Decision:
Let Xn be the nth bit of decoded codeword. If z(i)n ≥ 0, Xn = 0, else if zn(i) < 0, Xn =
1. If H(x(i))t = 0 or I
M AX is reached, the decoder stops and outputs the codeword.
Otherwise, it sets i = i + 1 and goes on iterative decoding.
The iterative decoding processes for one iteration of standard BP is illustrated below. The messages are updated in parallel way between check nodes and variable nodes. The process is shown in Fig. 3.1.
3.1.2
Variable-node-centric Sequential Scheduling (VSS)
Algorithm
High code rate LDPC code introduces high row degree. This makes implementation difficult due to the large number of inputs to sorter. The hardware cost and critical path of Check Node Unit (CNU) is greatly incresed. Shuffle decoding algorithm [9] [11] with
V1 V2 V3 V1 V2 V3 V4 V5 V1 V1 V4 V5 εi11 zi-114 zi-115
(a) Check node to variable node update of BP algorithm
V1 V2 V3 V1 V2 V3 V4 V5 V1 V2 V5 εi15 zi15
(b) Varibale node to check node update of BP algorithm
Figure 3.1: Illustratin of standard BP.
BP algorithm. The only difference between two algorithms is the updating procedure. Assume the N bits of a codeword are divided into G groups, so each group contains N/G = NG bits. The messages are only exchanged between variable nodes from one
group and check nodes which are connected to that group. In addition, each group of messages is updated in order. Furthermore, one iteration takes N cycles. For G = 1, the VSS scheduling becomes standard BP.
The normalized min-sum (NMS) algorithm which compensates the approximation er-ror in check node update step can also be applied to VSS approach with normalized factor β = 0.5. The updating procedure of NMS algorithm with VSS approach is carried out as follows.
1.Initialzation: For each m, n, set z0
mn = Pn
2.Iterative Decoding:
n ∈ N (m), process ǫimn = Y n′∈N (m)\n,n′≤g·NG−1 sign(zmni ′) × Y n′∈N (m)\n,n′≥g·NG sign(zmni−1′)× min min n′∈N (m)\n,n′≤g·NG−1|z i mn′| , min n′∈N (m)\n,n′≥g·NG|z i−1 mn′| × β (3.4)
(b)variable node to check node update step, for g · NG ≤ n ≤ (g + 1) · NG− 1 and each
m ∈ M (n), process zmni = Pn+ X m′∈M (n)\m ǫim′n (3.5) zni = Pn+ X m′∈M (n) ǫim′n (3.6) 3.Hard Decision:
Let Xn be the nth bit of decoded codeword. If z (i)
n ≥ 0, Xn= 0, else if z (i)
n < 0, Xn = 1. If
H(x(i))t= 0 or I
M AX is reached, the decoder stops and outputs the codeword. Otherwise,
it sets i = i + 1 and goes on iterative decoding.
The decoding process for one iteration of VSS is illustrated in Fig. 3.2 with G = 3 as example. The arrows with blue color represent check node to variable node messages to be updated. The arrows with red color represent variable node to check node messages to be updated. On the other hand, black lines represent that messages are not updated in that cycle.
C1 C2 C3
V1 V2 V3 V4 V5 V6
C1 C2 C3
V1 V2 V3 V4 V5 V6
(a) 1st group’s message updated
C1 C2 C3
V1 V2 V3 V4 V5 V6
C1 C2 C3
V1 V2 V3 V4 V5 V6
(b) 2nd group’s message updated
C1 C2 C3
V1 V2 V3 V4 V5 V6
C1 C2 C3
V1 V2 V3 V4 V5 V6
(c) 3rd group’s message updated
Figure 3.2: Illusion of VSS.
3.2
Performance-Related Parameters
3.2.1
Cycles in Tanner Graph
A LDPC code with cycle-4 introduces smaller trapping set [12]. It will cause per-formance degradation in water fall region. For LDPC code, we call this perper-formance degradation in water fall region, the error floor [13]. Therefore, constructing LDPC code with cycle-4 should be avoided and cycle should be as large as possible. Fig. 3.3 illustrates a Tanner Graph with cycle-6 cycles and its corresponding parity check matrix.
C1 C2 C3
V1 V2 V3 V4 V5 V6 C4
(a) A tanner graph with cycle-6
1
1
0
1
0
0
1
0
1
0
0
0
0
1
1
0
1
0
0
0
0
1
0
1
H
=
(b) Parity check matrix H corresponds to (a)
Figure 3.3: An example of a tanner graph with cycle-6.
3.2.2
Column Degree
A LDPC code with higher column degree has better performance in water fall region. It means that it can suppress the error floor in lower bit error rate region. Fig. 3.4 shows the performance of LDPC codes with different column degree. S represents scaling factor in this thesis.
In Fig. 3.4, (672, 588) is a LDPC code from IEEE 15.3c Standard, with column degree 3. It has poor performance at waterfall region due to its low column degree. LDPC code with column degree 8 and 12 has better performance at waterfall region.
3.5
4
4.5
5
10
-610
-510
-410
-310
-210
-1N=10
7, AWGN Channel,
Iteration = 25, Normalized Min-Sum
Eb/No(db)
B
ER
(9409,8256), Column Deg=12, S=0.4, R=0.877 (9153,8256), Column Deg=8, S=0.5, R=0.9 (672,588), Column Deg=3, S=0.4, R=0.875Figure 3.4: Performance of LDPC code with different column degree.
Fig 3.5 shows that LDPC code with higher column degree has better performance at waterfall region. Both (2071,1746) and (2033,1714) LDPC codes are constructed by permutation matrix algorithm [14] and will be introduced in next subsection. LDPC codes constructed by permutation matrix algorithm has no cycle-4. They are QC code [15] and their columne degree is 4. For (2048,1723) (IEEE 802.3an Standard [16]) LDPC, error floor will not appear until BER down to 10−10. Thus, high column degree LDPC code is
2.5
3
3.5
4
4.5
5
10
-610
-510
-410
-310
-210
-1N=10
7, Iteration = 50, Normalized Min-Sum
Eb/No(db)
B
ER
(2071, 1746), Column Deg=3, S=0.75 (2033, 1714), Column Deg=3, S=0.75 (2048, 1723), Column Deg=6, S=0.75Figure 3.5: Performance of LDPC code with different column degree.
In Fig 3.6, improvement of performance in waterfall region from higher column degree is not clear. Since codeword length is very long, the improvement is expected to appear in deeper Bit Error Rate region. Software computation is not fast enough to investigate the error floor. FPGA simulation will be done in the future. Error correcting code applied on NAND flash memory system requires high code rate and no performance degradation down to bit error rate near 10−12. Therefore, a higher column degree LDPC code with no
cycle-4 is preferred. The proposed LDPC code in this thesis is (9153, 8256), with column degree 8 and no cycle-4.
3
3.5
4
4.5
5
10
-710
-610
-510
-410
-310
-210
-110
5codewords, Iteration = 50, Normalized Min-Sum
Eb/No(db)
B
ER
(9160,8247), R=0.9, Column Deg=4, S=0.75 (9050,8149), R=0.9, Column Deg=5, S=0.75 (9153,8256), R=0.9, Column Deg=8, S=0.5Figure 3.6: Performance of LDPC code with different column degree.
3.3
Code Construction
3.3.1
Permutation Matrix Algorithm
Permutation matrix [14] algorithm is a code construction of QC LDPC code. The parity check matrix H of QC code is composed of many sub-matrixes. Each sub-matrix will be an Identity matrix or cyclic shift of an Identity matrix. An example of QC code is demonstrated in Fig 3.7. The number inside a sub-matrix represents the amount of cyclic shift.
Cycle-4 causes performance degradation and this code construction can avoid any cycle-4. Algorithm of code construction is described in [14]. In this thesis, we provide another view of this algorithm. There are 3 parameters to be decided: row degree (dc),
1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 H = H = 0 1 2 0 3 1
Figure 3.7: An example of QC LDPC code, dc = 3, dv = 2 and p = 4.
variable node degree (dv) and size of sub-matrix (p). Row degree determines the number
of sub-matrix in one sub-matrix row. And variable node degree determines the number of sub-matrix in one sub-matrix column.
Another view of permutation matrix algorithm:
Let’s Si,j represents the amount of cyclic shift in sub-matrix of ith sub-matrix row and
jth sub-matrix column. d
c represents row degree. dv represents variable node degree. And
p represents size of sub-matrix and must be a prime number. 1.Initialization :
S0,j = j, 0 ≤ j ≤ dc− 1
2.Completion of the remaining Si,j:
Si,j = (j + (j + 1) · i) mod p, 0 ≤ j ≤ dc − 1, 0 ≤ i ≤ dv− 1
Fig. 3.8 demonstrates the condition that cycle-4 occurs. For any 4 numbers in a square (the red dash box), if the difference between the cyclic shift amount in one sub-matrix column, is equal to the difference between the cyclic shift amount in other sub-matrix column, cycle-4 is formed. For example in Fig. 3.8, the difference between 1 and 2 is equal to the difference between 2 and 3.
1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 H = H = 0 1 2 0 2 3
Figure 3.8: Demonstration of cycle-4.
The number in small square represents the cyclic shift amount of that sub-matrix. There are n sub-matrix in one sub-matrix row and m sub-matrix in one sub-matrix column. p is the size of a sub-matrix.
1 1 2 2 1 2
Figure 3.9: Pariyt check matrix H.
C = [A + (x1 + 1)m′] mod p D = [B + (x2 + 1)m′] mod p (3.7) C − A = [(x1+ 1)m′] mod p D − B = [(x2+ 1)m′] mod p (3.8)
where m′ = m
2− m1; 0 ≤ x1 < x2 ≤ n − 1; 0 ≤ m1 < m2 ≤ m − 1; n, m ≤ p
Since p is a prime number and, x1 < x2 ≤ n − 1 and n ≤ p, (C − A) will never be
equal to (D − B). Therefore, no cycle-4 is formed.
3.3.2
Code Performance
The proposed LDPC code in this thesis is (9153, 8256) with code rate 0.9. Column degree is 8. The size of a sub-matrix is 113 and decoding algorihtm is Normalized Min-Sum. S represents scaling factor and number of iteration is 40 .Fig. 3.10 shows its performance. 4 4.2 4.4 4.6 4.8 10-6 10-5 10-4 10-3 10-2 10-1 N=107, R=0.9, S=0.5, Iteration=40 Eb/No(db) B ER
(9153,8256), AWGN Channel, Floating
Chapter 4
LDPC Decoder Architecture
4.1
Single Pipelined Architecture for VSS
Algorithm
Details of variable-node-centric sequential scheduling algorithm(VSS) [10] is intro-duced in previous section. Hardware architecture will be fully explained in this section.
The entire decoder depicted in Fig. 4.1(a) is composed of fully-parallel CNUs and partial-parallel VNUs. Variable nodes are divided into 27 groups (G = 27). There are 904 Check Node Units (CNU) and 339 Variable Node Units (VNU). Let αi
g,m denotes the
sorted messages (1st min, 2nd min and indices) from variable nodes in the gth group to
mth check node at ith iteration, which is:
αi g,m = min n′∈N (m)\n,g·NG≤n′≤(g+1)·NG−1 zimn′ (4.1) Then the magnitude part of check node to variable node message in equation 3.4 could be computed by the following equation:
ǫimn = min n αi j,m j<g, α i g,m,αi−1k,m k>g o (4.2) Fig. 4.1(b) demonstrates the timing diagram of proposed decoder. G initialization cycles are required to calculate α0
g,m for 0 ≤ g ≤ G − 1. Since only one subgroup of the
message zi
mn is updated in each cycle of one iteration, the main operation of CNU could
be simplified to calculate αi
g,m(local sorting) in each cycle and then perform global sorting
like equation 4.2. In single pipelined architecture, only messages αi
while the variable node to check node message zi
mn is on-the-fly calculated. The CNU
could be updated immediately after VNU’s operations in VSS approach and no variable to check node message need to be stored.
1st min 2nd min 1st min 2nd min 1st min 2nd min ... R ou ti n g N e tw o rk ... ... ... ... ... ... ... ... ... ... R o u ti n g N e tw ork ... ... ... ...
(a) Single pipelined architecture for VSS algorithm
1 2 G 1 2 G 1 C V C V C V C V C V C V C V C V C CLK Initialization Iteration 1 0 0 0 1 2 min, 2 min{nd , , , } m m mG α α ⋯α 1 0 0 0 1 2 3 4 min, 2 min{nd , , , } m m m m α α α α 1 1 0 0 1 2 3 4 min, 2 min{nd , , , } m m m m α α α α
Ready to update bit nodes in Group 1
1 m
α
( represents
sorted messages from group m)
V
(b) Timing Schedule
4.2
Check Node Unit (CNU)
This section presents detail CNU architecture based on VSS scheduling. The CNU architecture is further optimized to reduce storage requirement and the number of sorters. Different CNU architectures will affect the convergence speed and performance which will be discussed in the next chapter. The messages sent from VNU are converted from two’s complement format to sign-magnitude format for efficient computation of CNU. Therefore, the operation of check node to variable node update could be divided into magnitude part and sign part. For our proposed LDPC codes with row degree 81, the VSS approach with G = 27, the number of messages need to be computed in each CNU group is 3.
4.2.1
Accumulative Sorter
Fig. 4.2 illustrates the magnitude part of CNU, which is an accumulative sorter composed of a local sorter and a global sorter. The local sorter is used to find the local 1st min and 2nd min values in each subgroups, and global 1st min and 2nd min values of
a row will be found by a global sorter. G − 1 registers are required to store local 1st min
from different group. And local 2nd min is the same. The global sorter has 27 × 2 = 54
inputs in total. Number of registers will be increased if G becomes larger. This increases the number of inputs to global sorter and the critical path.
1st min 2nd min ... ... G-1 registers G lo b a l So rt e r 1 st m in 2 n d m in Global 1st min Global 2nd min
Local 1st min in different group
Figure 4.2: Conventional accumulative sorter.
group (G) = 3. R1stmin and R2ndminrepresent the local 1st min and 2nd min of each group
respectively. The value in registers is reset to infinity before initialization. Since G = 3, there are three variable nodes in each group and they provide new values to the sorter every cycle. Local 1st min and 2nd min will be obtained and stored in the registers. The
values in each register is shifted to the right. The global sorter chooses the global 1st min
and 2nd min from these 7 values (3 new inputs, local 1st min and 2nd min from 2 local
groups). The red number represents the global 1st min in that cycle.
st
Group 1 Group 2 Group 3 Group 1
R1st min ∞ ∞ 0.1 ∞ 0.4 0.1 0.7 0.4
R2nd min ∞ ∞ 0.2 ∞ 0.5 0.2 0.8 0.5
Inputs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7
Figure 4.3: Demonstration of conventional accumulative sorter.
4.2.2
Accumulative Sorter without 2
ndminimum value
To reduce storage memory, local 2nd min values and global 2nd min values are not
stored. The local 1st min value is the minimum value from G − 1 groups. And global 2nd
min value is taken from local 1st min value directly. This may cause some performance
loss.
When local 1st min value is smaller than global 1st min value, global 1st min value is
replaced by local 1st min value. Then value stored in local 1st min register should be set
to a maximum value. Local sorter starts to find the new local 1st min value.
Conventionally, when the current updating group is the same as the group that global 1st min value comes, global 2nd min value should be sent to the variable nodes. Since
There are some methods for compensation on global 2nd min such as multipling or
adding a scalar to original global 1st min. But these methods only provide limited
im-provement. Since local 1st min value from G − 1 groups contains updated information,
taking local 1st min value as global 2nd min value can provide better improvement.
1st min R G lo b al So rt e r st R
Figure 4.4: Accumulative sorter w/o 2nd min.
A demonstration is provided in Fig 4.4. We assume row degree = 9 and number of group (G) = 3. Rlocal represents the local 1st min and Rglobal represents the global 1st
min. Number of registers is indepentent of G. Number of inputs to local sorter is equal to N/G + 1 and number of inputs to local sorter is equal to 2. The new global 1st min
comes from the three new inputs, local 1st min and previous gloabl 1st min. The red
number represents the global 1st min in that cycle. After the initialization, the global 1st
min stored in register comes from gorup 1. At 4th cycle (group 1 update of 1st iteration),
there are new valus from group 1 and the global 1st min in register should be cleared.
Threrfore, global 1st min is replaced by local 1st min.
st
Group 1 Group 2 Group 3 Group 1
R local ∞ ∞ 0.4 ∞
Rglobal ∞ 0.1 0.1 0.4
Inputs 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.5 0.6 0.7
0
5
10
15
20
10
-710
-610
-510
-410
-310
-2N=1000627200, SNR = 5.1dB, (9153,8256), S=0.5, Q=(4,2),
Iteration
B
ER
MS MS-VSS with 2nd minMS-VSS, Global 2nd min = local 1st min + 0.25 MS-VSS, Global 2nd min = local 1st min + 0.5 MS-VSS, Global 2nd min = local 1st min
Figure 4.6: Performance of (9153, 8256) LDPC code with different global 2nd min
compen-sation, MS - MinSum, MS-VSS - MinSum with variable-node-centric sequential schedul-ing.
Figure 4.6 shows the performance of (9153, 8256) LDPC code with different global 2nd min compensation. MinSum with VSS algorithm has a faster convergence speed than
MinSum algorithm. If global 2nd min is not stored, there is some performance degradation
and the convergence speed decreases. But reduced storage memory version is preferred for the FPGA simulation. Compensation on global 2nd min (local 1st min) does not provide
any improvement. Thus, no compensation on global 2nd min is preferred. BER decreases
slowly after 10th iteration due to the absence of original global 2nd min value. Therefore.
4.3
Varible Node Unit (VNU)
Fig. 4.7 shows the architecture of a VNU. SM to TC represents sign-magnitude to two’s-complement conversion, and TC to SM represent two’s-complement to sign-magnitude conversion. Registers are corresponding to different channel values in the different groups. Since G = 27, there are 27 2-bits registers to store channel values in one VNU. The bit width of messages passing between CNU and VNU is 4. The variable node degree is 8. Thus, number of inputs of adder is 9. 2 bits channel value is mapped to 4 bits value by non-linear quantization. More details of non-linear quantization will be discussed in next chapter.
SM to TC SM to TC SM to TC
...
... R R R Channel Value...
Decoded bit Clipping Clipping Clipping...
TC to SM TC to SM TC to SM 4 4 4 4 4 4 8 4 7 7 7 4 4 4 1 MSB4.4
Shifting Network
High compexity of routing network between Check Node Units (CNU) and Varible Node Units (VNU), is the main difficulty for hardware implementation of LDPC code. Shifting Network [17] [18] [19] [20] has been proposed to reduce the routing complexity. There are two routing networks between CNU and VNU. One is the direction from CNUs to VNUs, while another one is the direction form VNUs to CNUs.
The shifting network of LDPC code, which is constructed by permutation matrix algorithm, can be simplified. The wire connection from CNUs to VNUs is fixed and no shifting network is needed. But messages of each CNU are shifted between CNUs. The idea is explained in Fig 4.8.
1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 H =
(a) Parity Check Matrix of a LDPC code, with variables divided into 6 groups
C1 C2 C3 V1 V2 C4 C5 C6 C2 C3 C4 V3 V4 C5 C6 C1 C3 C4 C5 V5 V6 C6 C1 C2 C4 C5 C6 V7 V8 C1 C2 C3 C5 C6 C1 V9 V10 C2 C3 C4 C6 C1 C2 V11 V12 C3 C4 C5
…
0 1 2 3 3 24 3 24Figure 4.9: Parity Check Matrix of (9153,8256) LDPC code.
Fig. 4.9 shows the cyclic shift amount of some sub-matrices in parity check matrix of (9153,8256) LDPC code. Since G = 27, 3 sub-martrices are processed in each decoding cycle. The difference between cyclic shift amount of each group is a constant. Thus, messages are shifted between CNUs after each decoding cycle and routing network can be eliminated.
4.5
Comparison with Conventional Architectures
For accumulative sorter in Fig. 4.2, larger subgroup number G will result in fewer inputs of local sorter but more inputs of global sorter. And the number of storage memory for 1st min, 2nd min, and index values will increase. In addition, the critical path willbe shorter when G is larger becasue sorter is smaller. In traditional two-stage pipelined architecture, both check node to variable node message and variable node to check node message are kept in registers or memory. Assume the bit-width w of messages is 4 and variable node degree is dv, then the required memory size (or registers) is as follows:
Conventional Min-Sum two-stage pipelined architecture:
RegV N U + RegCN U
= N · dv· w + m · 1stmin + 2ndmin + Index + Sign
= 9153 · 8 · 4 bits + 904 · (3 + 3 + 7 + 81)
= 377872 bits
(4.3)
VSS architecture with conventional accumulative sorter(Fig. 4.2):
RegV N U + RegCN U
= 0 + m · (local 1stmin + local 2ndmin+
global 1stmin + global 2ndmin + Index + Sign)
= 904 · (3 · 26 + 3 · 26 + 3 + 3 + 7 + 81) = 226000 bits
Proposed VSS architecture with No 2nd min accumulative sorter(Fig. 4.4):
RegV N U + RegCN U
= 0 + m · local 1stmin + global 1stmin + Index + Sign = 904 · (3 + 3 + 5 · 2 + 81)
= 87688 bits
(4.5)
Compared to the conventional Min-Sum two-stage pipelined architecture, proposed architecture reduces 76.8% registers. Compared to the VSS architecture with conventional accumulative sorter, proposed architecture reduces 61.2% registers with some performance loss. Since G = 27, the reduction of combinational circuit of VNU is approximately 96%.
Chapter 5
Simulation and Implementation
Result
5.1
Quantization
Belief Propagation (BP) is a probability-based message passing algorithm. When soft input is available, LDPC code can provide powerful correcting ability. LDPC code with 2-bit soft input can outperform BCH code under same code rate. Additive White Gaussian Noise (AWGN) channel with Binary Phase Shift Keying Modulation (BPSK) are used for demonstration and simulation. We assume that data ’0’ is mapped to ’1’ and data ’1’ is mapped to ’-1’. 2-bit quantization represents 4 levels. A bit with channel value near 0 has a high probability to be an error bit. Therefore, a non-linear quantization is preferred. We make a threshold f to divide channel value into 4 levels.
-f
f
Vmin -Vmin Vmax -Vmax 1 -1 0Fig. 5.2 shows the performance of LDPC code with different parameters f, Vmin and Vmax.
The bit width of Input LLR after non-linear quantization and messages passing between CNUs and VNUs in decoder are 4 bits. Decoding algorithm is Normalized Min-Sum algorithm with scaling factor = 0.5.
4.6 4.8 5 5.2 5.4 5.6 10-6 10-5 10-4 10-3 10-2 N=10 7 , Iteration = 40 Eb/No(db) B ER f=0.35 Vmin=0.50 Vmax= 1.75 f=0.50 Vmin=0.50 Vmax= 1.75 f=0.35 Vmin=0.75 Vmax= 1.75 f=0.50 Vmin=0.75 Vmax= 1.75 f=0.35 Vmin=1.00 Vmax= 1.75 f=0.50 Vmin=1.00 Vmax= 1.75
Figure 5.2: Performance of (9153, 8256) (Column deg = 8) LDPC code with different parameters.
Parameter f = 0.35, Vmin = 0.5 and Vmax = 1.75 provides the best performance.
In Fig. 5.3, the performance loss between floating input and 2 bits non-linear input quantization is 0.3dB. 2 bits non-linear input quantization can provides better perfor-mance than 4-bit linear input quantization. As more-bits input information requires more READ on NAND flash cell, latency of reading data will increase. Therefore, 2 bits non-linear input quantization is chosen.
4 4.5 5 5.5 6 10-6 10-5 10-4 10-3 10-2 10-1 N=107, R=0.9, Iteration=40, (9153,8256) Eb/No(db) B ER
Soft Input, Floating 2 bits non-linear Input, Q(4,2) 4 bits linear Input, Q(4,2) 4 bits linear Input, Q(4,1) 5 bits linear Input, Q(5,1) Hard Input, Q(4,2)
Figure 5.3: Performance of LDPC code with different input quantization.
5.2
Performance
In Fig. 5.4, there is 0.7dB coding gain of 2-bit non-linear soft input LDPC code over BCH code at BER=10−4. 2-bit non-linear soft input LDPC code has a great potential to
replace BCH code for NAND flash memory system. The simulation parameters of LDPC code are 4-bit quantization (2-bit integer and 2-bit decimal fraction), with scaling factor 0.5. The bit width of messages passing between CNU and VNU is 4.
Without storing global 2ndmin value introdueces 0.1dB performance loss. But
Variable-node-centric Sequential Scheduling (VSS) architecture with no 2nd min value reduces
4
4.5
5
5.5
6
10
-610
-510
-410
-310
-210
-1N=10
7, S=0.5, Iteration=40, R=0.9
Eb/No(db)
B
ER
(9153,8256), Soft Input, NMS, Floating (9153,8256), 2 bits Soft Input, VSS w/o 2nd min, Q(4,2) (9153,8256), 2 bits Soft Input, VSS w 2nd min, Q(4,2) (9153,8256), 2 bits Soft Input, NMS, Q(4,2) (9153,8256), Hard Input, NMS, Q(4,2) (9032,8192), BCH code, t=60
Figure 5.4: Performance comparison, Iteration = 40.
5.3
Throughput
Gate count and critical path of CNU and VNU after synthesize is listed in Table. 5.1. The critical path of CNU + VNU is 5ns. We assume that the critical path of control circuit is 2ns. Therefore the clock cycle is 7ns. The LDPC decoder can operate at a frequency of 125MHz.
Table 5.1: Synthesis result of CNU and VNU with technology UMC90. CNU(sign bit register is not included) VNU
Gate count 225 620
Number of iteration is 10 and clock frequency in Place and Route is 100Mhz.
T hroughput = Inf ormation length
Cycles per iteration · (N umber of iteration + 1) · Cycle length
= 8256
27 · (10 + 1) · 10ns ≈ 2.78Gbps
5.4
Implementation Results
Table 5.2: Summary of implementation result (Place and Route). Proposed LDPC Decoder Technology UMC 90nm 1P9M Code Spec (9153,8256) Code Rate 0.9 Row Degree 81 Column Degree 8 Algorithm Variable-node-centric Sequential Scheduling Area 4.82 mm2 ( No IO Pad ) Gate Count 1100k Iteration 10
Input Quantization 2 bits Clock Frequency 100MHz Maximum Throughput 2.78 Gbits/s
Power 437 mW
Table 5.2 shows the postlayout result. Gate Count after synthesis is 1100k and Core area is 4.82mm2 without IO pad. Using 90nm CMOS technology, the maximum
through-put can achieve 2.78 Gbps under operating frequency of 100Mhz with 10 iterations. Power consumption is 437mW.
Chapter 6
Conclusion and Future Work
6.1
Conclusion
This thesis proposes a (9153, 8256) LDPC code with code rate 0.9 for NAND flash memory system. (9153,8256) LDPC code is constructed by permutation matrix algorithm, with column degree 8. Simulations show that LDPC code with 2-bit soft input can outperform BCH code under same code rate. Therefore, LDPC code is a good candidate to replace BCH code in the next generation standard.
High code rate LDPC code introduces high row degree. This makes implementation difficult due to the large number of inputs to sorter, and the routing complexity also increases. Variable-node-centric sequential scheduling (VSS) is a good solution to this problem. Variable nodes are divided into G groups. Check node update procedures are processed in G cycles, reducing the number of inputs to sorter. CNU is further modified to reduce the hardware cost. Compared to the conventional Min-Sum two-stage pipelined architecture, it saves approximately 96% combination circuits of VNU and reduces 76.8% registers. The maximum throughput can achieve 2.78 Gbps under operating frequency of 100Mhz with 10 iterations, using 90nm CMOS technology.
6.2
Future Work
Flash memory system requires Bit Error Rate (BER) down to 10−12. And this thesis
proposes a high column degree LDPC code in order to suppress error floor. Simulation of BER down to 10−12 consumes years on computer. Therefore, we will do simulation on
FPGA to investigate the performance of LDPC code down to 10−12 in the future.
There is no standard flash memory channel for any simulation. Therefore, a standard flash memory channel is desired if we want to compare performances of different error correcting code on flash memory. It is a new challenge and more details about flash memory will be studied.
References
[1] D. M. Greg Atwood, Al Fazio and B. Reaves, “Intel StrataFlashTM Memory Tech-nology Overview,” Intel TechTech-nology Journal, pp. 1–8, 4th Quarter 1997.
[2] R.C.Bose and D.K.Ray-Chaudhuri, “On a class of error-correcting binary group codes,” Inform. and Contr, vol. 3, pp. 68–79, March 1960.
[3] A. Hocquenghem, “Codes correcterus d’erreurs,” Chiffres, vol. 2, pp. 117–156, September 1959.
[4] W. J. ReidIII, L. L. Joiner, and J. J. Komo, “Soft Decision Decoding of BCH Codes Using Error Magnitudes,” IEEE Int. Symp. on Info. Theory, p. 303, June 1997. [5] Y. M. Lin, C. L. Chen, H. C. Chang, , and C. Y. Lee, “A 26.9K 314.5Mbps Soft
(32400, 32208) BCH Decoder Chip for DVB-S2 System,” in IEEE Asian Solid-State Circuits Conference, Nov. 2009, pp. 373–376.
[6] R.G.Gallager, “Low-Density Parity-Check Codes,” in MA: MIT Press, 1963.
[7] D. MacKay and R. Neal, “Near Shannon limit performance of low density parity check codes,” Electron. Lett, vol. 33, no. 6, pp. 457–458, March 1997.
[8] X.-Y.Hu, E. Eleftheriou, and D.-M. Arnold, “Progressive edge-growth Tanner graphs,” in Proc. IEEE Global Telecommunications Conf. (GLOBECOM), San An-tonio, TX, Nov. 2001, pp. 995–1001.
[9] J. Zhang and M. Fossorier, “Shuffled iterative decoding,” IEEE Transactions on Communications, vol. 53, no. 2, pp. 209–213, Feb. 2005.
[10] C.-L. Chen, K.-S. Lin, H.-C. Chang, W.-C. Fang, and C.-Y. Lee, “A 11.5-Gbps LDPC Decoder Based on CP-PEG Code Construction,” in ESSCIRC, 2009, pp. 412–415. [11] J. Sha, Z. Wang, M. Gao, and L. Lio, “Multi-Gb/s LDPC Code Design and
Imple-mentation,” IEEE Transactions on VLSI Systems, vol. 17, no. 2, pp. 262–268, Feb. 2009.
[14] H. Song, V. Kumar, and B.V.K., “Low-density parity check codes for partial response channels,” IEEE Signal Processing Magazine, pp. 56–66, Jan. 2004.
[15] M. Fossorier, “Quasicyclic low-density parity-check codes from circulant permutation matrices,” IEEE Trans. Inf. Theory, vol. 50, no. 8, pp. 1785–1793, Aug 2004. [16] IEEE Std. 802.3an, Carrier Sense Multiple Access with Collision Detection
(CSMA/CD) Access Method and Physical Layer Specifications Std., 2006.
[17] D.Oh and K.Parhi, “Area Efficient Controller Design of Barrel Shifters for Recon-figurable LDPC Decoders,” IEEE Internatinal Symposium on Circuits and Systems, pp. 240–243, May 2008.
[18] C.-H. Liu, C.-C. Lin, H.-C. Chang, and Y. C.-Y. Lee, “Multi-Mode Message Passing Switch Networks Applied for QC-LDPC Decoder,” IEEE Internatinal Symposium on Circuits and Systems, vol. 18, no. 1, pp. 85–94, Jan 2010.
[19] D.Oh and K.Parhi, “Low-Complexity Switch Network for Reconfigurable LDPC De-coders,” IEEE Transactions on Very Large Scale Integration Systems, pp. 752–755, May 2008.
[20] J. Lin, Z. Wang, L. Li, J. Sha, and M. Gao, “Efficient Shuffle Network Architecture and Application for WiMAX LDPC Decoders,” IEEE Transcations on Circuits and Systems, vol. 56, no. 3, pp. 215–219, March 2009.