1. A 11.5-Gbps LDPC Decoder Based on CP-PEG Code Construction
Low-density parity-check (LDPC) code is a famous error control code with near Shannon limit performance [A-1] and can be described by its parity-check matrix H. The message exchanging order between nodes is called scheduling, which will influence the convergence speed of the decoding algorithm. In standard BP algorithm, simultaneous update of all check node messages or variable node messages is named as flooding scheduling. Alternatively, the layered BP algorithm [A-3] performing message update along different check node groups is a method of check-node-centric sequential scheduling (CSS). Researchers have revealed that CSS could reduce maximum iteration to approximate half of the standard BP with similar performance.
Recently, LDPC codes adopted in high-throughput systems have high code-rate property to increase channel efficiency. However, the introduced large check node degree dc will cause implementation full of difficulties. Even though the CSS could reduce the iteration number, the throughput is still degraded due to long critical path of check node unit (CNU).
In this project, the proposed decoder aims at providing a high-throughput and hardware-efficient solution to the high code-rate LDPC with large check node degrees. In order to reduce the iteration number, the decoding scheduling is based on the variable-node-centric sequential scheduling (VSS; also known as shuffled decoding [A-6]), where the messages are updated along different variable node groups. Since the inputs of CNU operation are also divided into several subgroups, the complexity and critical path delay of CNU are reduced. Furthermore, single pipelined approach and modified CNU are proposed to minimize the message storage memory. Using a (2048, 1920) LDPC code constructed by circulant permutation progressive edge-growth (CP-PEG) algorithm [A-7] as a design example, the overall decoder chip
2
implemented in 90nmtechnology will show its advantages in terms of throughput, energy efficiency, and hardware efficiency.
2. A 2.37-Gb/s 284.8mW Rate-Compatible (491,3,6) LDPC-CC Decoder
Near the rediscover of LDPC-BCs, LDPC-CCs were proposed in 1999 [A-10]. LDPC-CCs have the characteristics of convolutional code not found in LDPC-BCs. Continuous encoding supports any length of input data stream, which is especially suitable for streaming video and packet-switching network. The puncture scheme applied in LDPC-CCs provides flexible code-rates by abandoning certain positions of encoded bits according to the puncture table.
Simple encoder circuitry composed by registers, multiplexers and a few XOR gates has lower hardware cost and power consumption, and can be used in distributed sensor network.
Furthermore, the correlation between codeword symbols of LDPC-CCs is limited to a specific interval (constraint length ms+1, ms is the memory size of encoder). This locality property lowers the overall routing complexity of the decoder. Although possessing many advantages, LDPC-CCs were few chosen by standards. The main reason lies in its bottlenecks of the long decoding latency, high power consumption, and low-to-moderate decoding throughput.
The throughput of LDPC-CC decoders reported in literatures was only several hundred Mbps, which is difficult to compete with LDPC-BC decoder with several tens of Gbps throughput. Cause of lower throughput can be explained by the decoder structure. LDPC-CC decoder consists of serially concatenated processors, and each processor decoding a sliding window on the trellis diagram can be taken as one iteration in LDPC-BCs. Increasing processor number can enhance error-correcting capability but cannot increase throughput. Therefore many works put efforts on realizing the parallel message passing: analysis of parallelization concepts in [A-11], single-instruction-multiple-data (SIMD) architecture in [A-12], and joint code-decoder design in [A-13]. Recently, a high throughput LDPC-CC decoder design was proposed by adding regularity during code construction [A-14]. However, achieving high throughput is still challenging for some time-varying LDPC-CC code without regularity.
3
B. BCH and RS Decoder
1. Soft BCH Decoder Chip for DVB-S2 System
The Bose-Chaudhuri-Hocquenghen (BCH) [B-1] codes are popular in storage and communication systems recently. From DMB-T [B-2] and DVB-S2 [B-3] applications shown in Fig. 1, the BCH codes with long block length are specified to suppress the error floor due to iterative LDPC decoding. Since BCH codes perform as outer codes in those communication systems, the soft information from the inner decoder can be employed to further improve the error-correcting performance.
Fig. 1. Block diagram of DMB-T and DVB-S2 systems
Soft decision decoding of BCH codes has aroused many research interests. Forney developed the generalized-minimum-distance (GMD) [B-4], which uses algebraic algorithms to generate a list of candidate codewords and chooses a most likely codeword from the list. Other algorithms with the same concept of candidate list, such as Chase [B-5] and SEW [B-6], are also widely used in many applications. This report illustrates a soft BCH decoding method using error magnitudes [B-7] to deal with the least reliable bits. For example, Fig. 2 shows the results of a concatenated code with 16-state BCJR [B-8] and BCH (255,239) over GF(28). Based on the soft information from previous decoder, the performance gain of the BCH decoder is about 0.73 db at BER = 10−6 when 2t+1 candidate bits within a codeword are chosen to correct errors.
4
Fig. 2. Simulation results for BCH (255,239) concatenating with a 16-state BCJR under BPSK modulation and AWGN channel
2. An Improved Soft BCH Decoder with One Extra Error Compensation
The Bose-Chaudhuri-Hocquenghen (BCH) [B-1] codes are popular in storage and communication systems, such as flash device, DMB-T [B-2] and DVB-S2 [B-3] broadcasting systems. Recently, soft decoding of BCH codes has aroused many research interests. Forney developed the generalized-minimum-distance (GMD) [B-4] to generate a list of candidate codewords and choose a most likely codeword from the list. Other algorithms with similar concept, such as Chase [B-5] and SEW [B-6], are also widely used in many applications.
Moreover, Therattil and Thangaraj provided a sub-optimum MAP BCH decoding method with Hamming SISO decoder in 2005 [B-12].
In general, the complexity of a soft BCH decoder is much higher than a hard BCH decoder for decoding an entire codeword. Nevertheless, soft BCH decoders with lower complexity can be revealed by focusing on the least reliable bits instead of the whole codeword. A soft BCH decoding algorithm using error magnitudes to deal with the least reliable bits was developed in 1997 [B-7]. However, Fig. 3 shows that there is about 0.25 dB performance loss at BER = 10−5 in AWGN channel as compared to hard decision BCH decoder for BCH (255,239) code. For the existing soft decision algorithms, the soft BCH decoder provides either better error correcting
5
performance or lower hardware complexity than a traditional hard BCH decoder. In this project, a soft BCH decoder which has similar concept as [B-7] and enhances the correcting performance by compensating one extra error while maintaining the low hardware complexity is presented.
Fig. 3. Simulation Result for BCH (255,239)
3. Soft RS Decoder Chip for Optical Communication System
Reed-Solomon (RS) codes are widely used in various communications and digital data storage systems due to the advantage of overcoming the burst errors. According to International Telecommunication Union (ITU-T) recommendation, RS (255,239) is standardized in high speed optical fiber systems and Gigabit Passive Optical Network (GPON) applications, which demand 2.5 Gb/s throughput for achieving 10-40 Gb/s with 16 RS decoders or satisfying the maximum up and down link requirement. To resist the increasing noise induced by higher transmission rate required in optical communication systems, soft RS decoding algorithms are exploited for achieving considerable performance gain. These algorithms modify the received sequences to form a list of candidate codewords and choose the most probable one. Nevertheless, because of high computational complexity, these soft decoding algorithms are still unsuitable for practical implementation. In this project, a decision-confined soft decoding algorithm is proposed to enhance performance while maintaining area efficiency. Instead of decoding all the candidate codewords like other soft decoding algorithms, our design only decodes the candidate codeword with the degree of
6
error-locator polynomial Λ(x) less than error correction capability t. The Gray code based bit-flipping method is also exploited leading to only one suit of hardware requirement.
C. Viterbi Decoder
1. A Low-Power Viterbi Decoder Based on Scarce State Transition and Variable Truncation Length
The Viterbi decoder implementing the Viterbi algorithm [C-1] for decoding convolutional codes is composed of three main blocks: the branch metric (BM) unit, the ACS unit, and the survivor memory. The BM unit generates branch metrics from the input data. The ACS unit recursively accumulates branch metrics as path metrics (PM) and makes decisions to select the most likely state transition sequences, or the survivors. Survivor memory stores the survivors for retrieving the data sequence.
There are two well-known survivor memory management approaches: the register-exchange (RE) and the trace-back (TB) [C-2]. The register-exchange is conceptually the simplest technique that eliminates repeatedly memory access operations. Therefore, this approach has shorter latency and is suitable for high speed decoder implementations. However, due to the data movement among registers, the approach is considered to be power inefficient.
Fig. 4 shows the conventional 2υ-state Viterbi decoder with the register-exchange architecture [C-3]. The decisions from ACS units will be shifted within the survivor memory from left to right. Applying the scarce state transition (SST) technique and the variable truncation length, we illustrate the proposed low-power Viterbi decoder for the MB-OFDM UWB system [C-4] in Fig. 5. The SST unit is integrated to reduce state transition activities, leading to less dynamic power consumption [C-8]. Furthermore, the path merging detector monitors the merged point for all survivors and adjusts the truncation length to avoid unnecessary data movement in registers. Many redundant operations in the survivor memory can be reduced to save power dissipation. Additionally, considering the high throughout
7
requirement, the radix-4 ACS structure is exploited because of a better compromise between cost and throughput [C-5].
Fig. 4. The conventional register-exchange architecture
Fig. 5. The proposed Viterbi Decoder Architecture
For the Viterbi decoder, the scarce state transition (SST) algorithm is a low power technique to reduce the state transition activity significantly under high SNR conditions [C-6]-[C-8]. In the conventional Viterbi decoding model (see Fig. 6 (A)), u(D) denotes the information sequence, C(D)=u(D)G(D) is the codeword sequence deriving from the generator polynomial G(D). From the received sequence r(D), the Viterbi decoder estimates the decoded
8
information o(D).
The SST Viterbi decoding architecture in Fig. 6 (B) includes two additional blocks:
pre-decoder and re-encoder. Assume
r(D) = u(D) G(D) + e(D) = C(D) + e(D) ( 1 ) and e(D) is the error sequence from a noisy channel, the pre-decoder directly decode the information sequence from r(D):
-1 ˆ
i(D) = r(D) G (D) = u(D) ( 2 ) The re-encoder then encodes i(D) to a new codeword z(D).
z(D) = i(D) G(D) = C(D) ˆ ( 3 ) The Viterbi decoder performs maximum likelihood decoding on y(D), which is defined as follows:
y(D) = r(D) + z(D) = C(D) + e(D) + C(D)ˆ ( 4 ) In high SNR conditions, e(D) is nearly zero, and the decoded information sequence becomes
ˆ
o(D) = i(D) + n(D) = u(D) + n(D) ( 5 ) If the channel condition is good enough, the decoder estimates an approximately zero sequence; as a result, the dynamic power is reduced as the channel becomes better.
Fig. 6. (A) Conventional model (B) SST decoding model
9
2. A Low Power Differential Cascode Voltage Switch with Pass Gate Pulsed Latch for Viterbi Decoder
In mobile communication systems, and especially in wireless local area networks (WLAN), information must be transmitted at high data rates. Furthermore, an efficient error-control code is commonly adopted to enhance system performance. Accordingly, convolution codes have been exploited extensively in communication systems, as they provide a superior error correction capacity while maintaining reasonable coding complexity. The Viterbi algorithm is one of the best algorithms for decoding convolution codes with modest computing resources. However, as data rates increase, the power dissipation and system complexity also increase. Moreover, as required transmission rates of wireless systems increase, the error-control mechanism has come to dominate power dissipation.
Fig. 7 presents the power distribution along a Viterbi Decoder we have proposed in [C-10]
using UMC 90 nm CMOS technology. The survivor memory unit (SMU) is constricted by a register-array and dissipates most power in a Viterbi Decoder. Therefore, high performance, low power consumption, and robustness are the basic requirements of the design of clocked storage elements in a Viterbi decoder. This project presents a low power differential cascade voltage switch with pass-gate (DCVSPG) pulsed latch for the Viterbi decoder.
Fig. 7. Power distributions of a Viterbi Decoder
10
III. 研究方法及成果 A. LDPC Decoder
1. A 11.5-Gbps LDPC Decoder Based on CP-PEG Code Construction
a) CODE STRUCTURE AND DECODING ALGORITHM (1) CP-PEG LDPC Code Construction
The (2048, 1920) irregular LDPC code, rate-15/6, used in this project was constructed by CP-PEG algorithm and shown in Fig. 8(a). The constructed parity-check matrix H consists of p*p circulant permutation (CP) and all-zero matrices. A CP matrix is a cyclic square matrix with constant row and column weight of one. The number of each CP matrix indicates the cyclic shift amount and -1 means all zero matrixes. By setting p=32, there are 4*p check nodes and 64*p variable nodes in bipartite graph, where each check node has uniform degree 46, and 16*p, 24*p, 24*p variable nodes have degrees of 4, 3, 2 respectively. The performance of this code was proven to have better performance than other PEG-based LDPC codes [A-7]; nevertheless, the high check node degree required suitable decoder architecture to overcome implementation difficulties.
(2) Variable-node centric Sequential Scheduling
In VSS approach, the initialization, stopping criterion test, and output steps remain the same as the standard BP algorithm. The only difference between two algorithms lies in the updating procedure. The normalized min-sum (NMS) algorithm which compensates the approximation error in check node In VSS approach, the initialization, stopping criterion test, and output steps remain the same as the standard BP algorithm. The only difference between two algorithms lies in the updating procedure. The normalized min-sum (NMS) algorithm which compensates the approximation error in check node is shown and described in the next page.
In this work, the codeword is divided into G=4 groups, therefore the parity-check matrix H is divided into 4 sub-matrices (H1 to H4). As shown in Fig. 8(b), each sub-matrix consists of equal
11
number of variable nodes with the same degree to reduce the hardware cost of variable node unit (VNU). Moreover, the sub-matrices with the same shift amounts (shaded blue CP matrices) are arranged in the same position thus the routing and control could be further simplified.
Fig. 8. Parity-Check Matrix of(2048,1920) LDPC Code
12
b) PROPOSED DECODER ARCHITECTURE
In this section, the complete decoder architecture will be presented, including data path, scheduling, and VLSI structure of CNU and modified CNU.
(1) Single Pipelined Architecture
The entire decoder depicted in Fig. 9(a) is composed of fully-parallel CNUs and partial-parallel VNUs, where the VNU2, VNU3, and VNU4 will handle variable node operations with degree 2, 3, and 4 respectively. Letg( )i denote the sorted messages sent from variable nodes in the g-th group to one specific check node at i-th iteration, which is:
Then the magnitude part of check node to variable node message in (1) could be computed by the following equation:
Fig. 9(b) demonstrates the timing diagram of proposed decoder. There are G initialization cycles required to calculate ( )gi
for 0 ≤ g ≤ G − 1. Since only one subgroup of the message
( )i
z is updated in g-th cycle of one iteration, the main operation of CNU could be simplified to mn
calculate g( )i (local sorting) in each cycle and then perform global sorting like equation (5).
From the proposed single pipelined architecture, only messages g( )i and mn( )i are stored. The sorted results could be represented by min value, second min value, and the index of min value in NMS algorithm. Therefore, the proposed decoder only latches two values, one index, and sign part of messages in each subgroup, while the variable node to check node message z is mn( )i on-the-fly calculated. The single pipelined architecture is feasible because the CNU could be updated immediately after VNU’s operations in VSS approach.
13
Fig. 9. Proposed Architecture and Scheduling
(2) Modified CUN
The operation of check node to variable node update could be divided into magnitude part and sign part. Fig. 10(a) illustrates the magnitude part of CNU, which is an accumulative sorter composed of a local sorter and a global sorter. The local sorter is used to find the local min and second min values in each subgroups, and global min and second min values of a check node will be found by a global sorter. Similarly, the sign operation can be computed in an accumulative way like the accumulative sorter.
For our proposed CP-PEG LDPC codes with dc = 46, The VSS approach with G = 4 could divide 46 inputs of the sorter into only 12 inputs. More subgroup number G will result in fewer inputs of sorter, but increase the storage for min, second min, and index values of each subgroup.
In order to further reduce the storage overhead of each subgroup, we propose a reduced storage accumulative sorter as shown in Fig. 10(b). The basic idea is to simplify the local min and local
14
second min from G − 1 subgroups into one group. Some extra control circuits are needed to open or close the feedback loop in Fig. 10(b). This sorter architecture is beneficial since the complexity reduction of storage registers and global sorters is higher than the overhead of control circuits. Section IV will show the performance of this modified CNU is similar to original CNU.
(3) Summary
In traditional two-stage pipelined architecture, both z and mn( )i mn( )i messages are kept in registers or memory. Assume the bit-width of messages is w (= 6) and variable node degree is dv, then the required memory size (or registers) is as follows:
For the proposed single pipelined decoder and modified CNU in Fig.4 (b), the memory size is reduced to
Therefore the overall register reduction of proposed architecture is 73%, leading to the following advantages: fewer registers, higher utilization of functional units, and reduced complexity. Since high-rate LDPC codes usually have more VNUs than CNUs (in our case: 512 VNUs and 128 CNUs), the elimination of registers from VNU to CNU not only reduces hardware cost but also lowers power consumption of clock tree.
15
Fig. 10. CNU Architecture
c) PERFORMANCE AND IMPLEMENTATION RESULTS
Under AWGN channel with BPSK modulation, the performance curves are simulated to determine the required bit-width and maximum iteration number. The simulation parameters of proposed algorithm are 6-bit input quantization (5-bit integer and 1-bit decimal fraction), scaling factor 0.75 for NMS algorithm, and 4 or 5 iterations. In Fig. 11, the bit-error rate (BER) curves indicate that 4 iterations for the proposed algorithm are sufficient to achieve similar performance of standard BP algorithm with 7 iterations. Furthermore, in the aspect of almost the same code-rate and better error-correcting capability, our CP-PEG LDPC codes outperforms 1.2 dB better than the (255, 239) RS code at BER=10− 5, which reveals the potential of CP-PEG LDPC codes for storage applications and fiber optical communication systems. The overall SNR loss between this work and Shannon limit is only 1.6dB. The proposed LDPC decoder is implemented by standard-cell design flow and fabricated in 90-nm 1P9M CMOS technology. The core occupied 3.84 mm2 of area with 68% utilization. The die photo is shown in Fig. 12, where the distribution of CNUs and VNUs is auto-determined by APR tool. Since required decoding cycles of one LDPC codeword are 4 initialization cycles plus 4 iterations, the throughput is (1920bit/20cycles)×frequency. Fig. 13 shows the measured maximum throughput and power dissipation under different SNR conditions and supply voltages. The measurement result
16
indicates that the test chip with FF corner can achieve 11.5 Gbps throughput under 1.4V supply voltage. The throughput could be scaled down to 5.77Gbps with 0.8V supply voltage to meet the throughput requirement of IEEE 802.15.3c standard and the energy efficiency will be 0.033 nJ/bit.
Compared with the state-of-the-art in
Table 1, the proposed LDPC decoder outperforms others in the aspects of throughput, hardware efficiency, and power efficiency. Since the LDPC code specifications of these designs are different, the SNR loss between each work to their Shannon limit is also listed for reference.
Fig. 11. Performance
d) CONCLUSION
A high-throughput and power-efficient LDPC decoder is presented. Utilizing the characteristic of variable-node-centric sequential scheduling, the proposed decoding algorithm could reduce the maximum iteration number without performance loss. In addition, the single pipelined architecture and modified CNU can save 73% message storage memory and decrease the sorter size, resulting in a low-complexity design. After implementation in 90nm technology, the test chip occupies 3.84 mm2 of area and supports maximum 11.5 Gbps data rate under 1.4V
A high-throughput and power-efficient LDPC decoder is presented. Utilizing the characteristic of variable-node-centric sequential scheduling, the proposed decoding algorithm could reduce the maximum iteration number without performance loss. In addition, the single pipelined architecture and modified CNU can save 73% message storage memory and decrease the sorter size, resulting in a low-complexity design. After implementation in 90nm technology, the test chip occupies 3.84 mm2 of area and supports maximum 11.5 Gbps data rate under 1.4V